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Foreword 



Algorithmic learning theory is mathematics about computer programs which 
learn from experience. This involves considerable interaction between various 
mathematical disciplines including theory of computation, statistics, and com- 
binatorics. There is also considerable interaction with the practical, empirical 
fields of machine and statistical learning in which a principal aim is to predict, 
from past data about phenomena, useful features of future data from the same 
phenomena. 

The papers in this volume cover a broad range of topics of current research 
in the field of algorithmic learning theory. We have divided the 29 technical, 
contributed papers in this volume into eight categories (corresponding to eight 
sessions) reflecting this broad range. The categories featured are Inductive Infer- 
ence, Approximate Optimization Algorithms, Online Sequence Prediction, Sta- 
tistical Analysis of Unlabeled Data, PAG Learning & Boosting, Statistical Su- 
pervised Learning, Logic Based Learning, and Query & Reinforcement Learning. 

Below we give a brief overview of the field, placing each of these topics in the 
general context of the field. Formal models of automated learning reflect various 
facets of the wide range of activities that can be viewed as learning. 

A first dichotomy is between viewing learning as an indefinite process and 
viewing it as a finite activity with a defined termination. Inductive Inference 
models focus on indefinite learning processes, requiring only eventual success of 
the learner to converge to a satisfactory conclusion. 

When one wishes to predict future data, success can be enhanced by making 
some restrictive but true assumptions about the nature (or regularities) of the 
data stream. In the learning theory community, this problem is addressed in 
two different ways. The first is by assuming that the data to be predicted is 
generated by an operator that belongs to a restricted set of operators that is 
known to the learner a priori. The PAG model and some of the work under the 
Inductive Inference framework follow this path. Alternatively, one could manage 
without any such prior assumptions by relaxing the success requirements of the 
learner: rather than opting for some absolute degree of accuracy, the learner is 
only required to perform as well as any learner in some fixed family of learners. 
Thus, if the data is erratic or otherwise hard to predict, the learner can ignore 
poor accuracy as long as no member of the fixed reference family of learners can 
do no better. This is the approach taken by some Online Sequence Prediction 
models in the indefinite learning setting and, also, by most of the models of 
Statistical Learning in the finite horizon framework. 

Boosting is a general technique that applies a given type of learner iteratively 
to improve its performance. Boosting approaches have been shown to be effective 
for a wide range of learning algorithms and have been implemented by a variety 
of methods. 
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A second dichotomy is between Supervised and Un-Supervised learning. The 
latter we refer to as learning from Unlaheled Data. In the first scenario, the data 
has the form of an example-label pairs. The learner is trained on a set of such 
pairs and then, upon seeing some fresh examples, has to predict their labels. 
In the latter model, the data points lack any such labeling, and the learner has 
to find some persistent regularities in the data, on the basis of the examples 
it has seen. Such regularities often take the form of partitioning the data into 
clusters of similar points, but in some cases take other forms, such as locating 
the boundaries of the support of the data generating distribution. 

Many learning algorithms can be viewed as searching for an object that fits 
the given training data best. Such optimization tasks are often computationally 
infeasible. To overcome such computational hurdles, it is useful to apply algo- 
rithms that search for approximations to the optimal objects. The study of such 
algorithms, in the context of learning tasks, is the subject of our Approximate 
Optimization Algorithms session. 

There is a large body of research that examines different representations of 
data and of learners’ conclusions. This research direction is the focus of our Logic 
Based Learning session. 

A final important dichotomy separates models of interactive learning from 
those that model passive learners. In the first type of learning scenarios the ac- 
tions of the learner affect the training data available to it. In the Query Learn- 
ing model this interaction takes the form of queries of certain (pre-indicated) 
type(s) that the learner can pose. Then the data upon which the learner bases 
its conclusions are the responses to these queries. The other model that addresses 
interactive learning is Reinforcement Learning, a model that assumes that the 
learner takes actions and receives rewards that are a function of these actions. 
These rewards in turn are used by the learner to determine its future actions. 
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Preface 



This volume contains the papers presented at the 15th Annual International 
Conference on Algorithmic Learning Theory (ALT 2004), which was held in 
Padova (Italy) October 2-5, 2004. The main objective of the conference was 
to provide an interdisciplinary forum for discussing the theoretical foundations 
of machine learning as well as their relevance to practical applications. The 
conference was co-located with the 7th International Conference on Discovery 
Science (DS 2004) and the 11th Conference on String Processing and Information 
Retrieval (SPIRE 2004) under the general title “The Padova Dialogues 2004” . 

The volume includes 29 technical contributions that were selected by the 
program committee from 91 submissions. It also contains the invited lecture for 
ALT and DS 2004 presented by Ayumi Shinohara (Kyushu University, Fukuoka, 
Japan) on “String Pattern Discovery”. Furthermore, this volume contains the 
ALT 2004 invited talks presented by Nicolb Cesa-Bianchi (Universita degli Studi 
di Milano, Italy) on “Applications of Regularized Least Squares in Classification 
Problems”, and by Luc De Raedt (Universitat Freiburg, Germany) on “Proba- 
bilistic Inductive Logic Programming” . 

Additionally, it contains the invited lecture presented by Esko Ukkonen (Uni- 
versity of Helsinki, Finland) on “Hidden Markov Modelling Techniques for Hap- 
lotype Analysis” (joint invited talk with DS 2004). Moreover, this volume in- 
cludes the abstract of the joint invited lecture with DS 2004 presented by Pedro 
Domingos (University of Washington, Seattle, USA) on “Learning, Logic, and 
Probability: A Unified View”. 

Finally, this volume contains the papers of the research tutorials on Statistical 
Mechanical Methods in Learning by Toshiyuki Tanaka (Tokyo Metropolitan Uni- 
versity, Japan) on “Statistical Learning in Digital Wireless Communications”, 
by Yoshiyuki Kabashima and Shinsuke Uda (Tokyo Institute of Technology, 
Japan), on “A BP-Based Algorithm for Performing Bayesian Inference in Large 
Perceptron-like Networks”, and by Manfred Opper and Ole Winther on “Ap- 
proximate Inference in Probabilistic Models” . 

ALT has been awarding the E. Mark Gold Award for the most outstanding 
paper by a student author since 1999. This year the award was given to Hubie 
Chen for his paper “Learnability of Relatively Quantified Generalized Formulas” , 
co-authored by Andrei Bulatov and Victor Dalmau. 

This conference was the 15th in a series of annual conferences established 
in 1990. Continuation of the ALT series is supervised by its steering commit- 
tee consisting of: Thomas Zeugmann (Hokkaido Univ., Sapporo, Japan), Chair, 
Arun Sharma (Queensland Univ. of Technology, Australia), Co-chair, Naoki Abe 
(IBM T.J. Watson Research Center, USA), Klaus Peter Jantke (DFKI, Ger- 
many), Roni Khardon (Tufts Univ., USA), Phil Long (National Univ. of Sin- 
gapore), Hiroshi Motoda (Osaka University, Japan), Akira Maruoka (Tohoku 
Univ., Japan), Luc De Raedt (Albert-Ludwigs-Univ., Germany), Takeshi Shi- 
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nohara (Kyushu Institute of Technology, Japan), and Osamu Watanabe (Tokyo 
Institute of Technology, Japan). 

We would like to thank all individuals and institutions who contributed to 
the success of the conference: the authors for submitting papers, the invited 
speakers for accepting our invitation and lending us their insight into recent 
developments in their research areas, as well as the sponsors for their generous 
financial and logistical support. 

We would also like to thank Thomas Zeugmann for assisting us via his ex- 
perience in the publication of previous ALT proceedings, for providing the ALT 
2004 logo, and for managing the ALT 2004 Web site. We are very grateful to 
Frank Balbach who developed the ALT 2004 electronic submission page. 

Furthermore, we would like to express our gratitude to all program committee 
members for their hard work in reviewing the submitted papers and participating 
in online discussions. We are also grateful to the external referees whose reviews 
made a considerable contribution to this process. 

We are also grateful to the DS 2004 chairs Einoshin Suzuki (PC Chair, 
Yokohama National University, Japan) and Setsuo Arikawa (Conference Chair, 
Kyushu University, Japan) for their effort in coordinating with ALT 2004, and 
to Massimo Melucci (University of Padova, Italy) for his excellent work as the 
local arrangements chair. Last but not least. Springer provided excellent support 
in preparing this volume. 
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Remembering Carl Smith, 1950—2004 



Sadly, Carl Smith passed away 10:30PM, July 21, 2004. He had had a 1.5 year 
battle with an aggressive brain tumor. He fought this battle with calm optimism, 
dignity, and grace. He is survived by his wife, Patricia, his son, Austin, and his 
sister, Karen Martin. 

Carl was very active in the algorithmic or computational learning communi- 
ties, especially in the inductive inference subarea which applies recursive function 
theory techniques. 

I first met Carl when I interviewed for my faculty position at SUNY /Buffalo 
in the Spring of 1973. He was then a graduate student there and told me he 
was interested in recursive function theory. After I joined there, he naturally 
became my Ph.D. student, and that’s when we both began working on inductive 
inference. We spent a lot of time together, pleasantly blurring the distinction 
between the relationships of friendship and advisor-student. 

After Buffalo, Carl had faculty positions at Purdue and, then, the University 
of Maryland. 

Carl had a very productive career. He was a master collaborator working 
with many teams around the world. Of course he also produced a number of 
papers about inductive inference by teams — as well as papers about anomalies, 
queries, memory limitation, procrastination, and measuring mind changes by 
counting down from notations for ordinals. I had the reaction to some of his 
papers of wishing I’d thought of the idea. This especially struck me with his 
1989 TCS paper (with Angluin and Gasarch) in which it is elegantly shown that 
the learning of some classes of tasks can be done only sequentially after or in 
parallel with other classes. 

Carl played a significant leadership role in theoretical computer science. In 
1981, with the help of Paul Young, Carl organized the Workshop on Recursion 
Theoretic Aspects of Computer Science. This became the well known, continuing 
series of Computational Complexity conferences. Carl provided an improvement 
in general theoretical computer science funding level during his year as Theory 
Program Director at NSF. He was involved, in many cases from the beginning, in 
the COLT, AH, ALT, EuroCOLT, and DS conferences, as a presenter of papers, 
as a member of many of their program committees and, in some cases, steering 
committees. He spearheaded the development of COLT’s Mark Fulk Award for 
best student papers and managed the finances. 

Carl was very likable. He had a knack for finding funding to make good things 
happen. He was a good friend and colleague. He is missed. 
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Abstract. Finding a good pattern which discriminates one set of strings 
from the other set is a critical task in knowledge discovery. In this paper, 
we review a series of our works concerning with the string pattern dis- 
covery. It includes theoretical analyses of learnabilities of some pattern 
classes, as well as development of practical data structures which support 
efficient string processing. 



1 Introduction 

A huge amount of text data or sequential data are accessible in these days. 
Especially, the growing popularity of Internet have caused an enormous increase 
of text data in the last decade. Moreover, a lot of biological sequences are also 
available due to various genome sequencing projects. Many of these data are 
stored as raw strings, or in semi-structured form such as HTML and XML, 
which are essentially strings. String pattern discovery, where one is interested 
in extracting patterns which characterizes a set of strings or sequential data, 
has attracted widespread attentions [1,36,13,24,12,3,4,30]. Discovering a good 
rule to separate two given sets, often referred as positive examples and negative 
examples, is a critical task in Machine Learning and Knowledge Discovery. In this 
paper, we review a series of our works for finding best string patterns efficiently, 
together with their theoretical background. 

Our motivations originated in the development of a machine discovery system 
BONSAI [31], that produces a decision tree over regular patterns with alphabet 
indexing, from given positive set and negative set of strings. The core part of 
the system is to generate a decision tree which classifies positive examples and 
negative examples as correctly as possible. For that purpose, we have to find 
a pattern that maximizes the goodness according to the entropy information 
gain measure, recursively at each node of trees. In the initial implementation, 
a pattern associated with each node is restricted to a substring pattern, due to 
the limit of computation time. In order to allow more expressive patterns while 
keeping the computation in reasonable time, we have introduced various tech- 
niques gradually [15,17,16,7,19,8,21,32,6,18]. Essentially, they are combinations 
of pruning heuristics in the huge search space without sacrificing the optimal- 
ity of the solution, and efficient data structures which support various string 
processing. 
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In this paper, we describe the most fundamental ideas of these works, by 
focusing only on substring patterns, subsequence patterns, and episode patterns, 
as the target patterns to be discovered. We also show their learnabilities in the 
probably approximately correct (PAC) learning model, which are the background 
theory of our approach. 

2 Preliminaries 

For a finite alphabet S, let S* be the set of all strings over S. For a string w, 
we denote by |w| the length of w. For a set S' C S* of strings, we denote by |S| 
the number of strings in S, and by ||S|| the total length of strings in S. Let N 
be the set of natural numbers. 

We say that a string u is a prefix {substring, suffix, resp.) of w if w = vy 
{w = xvy, w = XV, resp.) for some strings x,y € S*. We say that a string v 
is a subsequence of a string ru if can be obtained by removing zero or more 
characters from w, and say that w is a supersequence of v. 

A pattern class is a pair C = {II, m), where 7T is a set called the pattern set and 
m : S* X n ^ {0, 1} is the pattern matching function. An element p € II is called 
a pattern. For a pattern p G II and string w G If*, we say p of class C matches w 
iff m{w,p) = 1. For a pattern p, we denote by Lffip) = {w G if* | m{w,p) = 1} 
the set of strings in E* which p of class C matches. 

In this paper, we focus on the following pattern classes, and their languages. 



Definition 1. The substring pattern class is defined to be a pair {E*,substr) 
where substr{w,p) = 1 iff p is a substring of w. The subsequence pattern class 
is a pair {E* , subseq) where subseq{w,p) = 1 iff p is a subsequence of w. The 
episode pattern class is a pair {E* x J\f, epis) where epis{w, {p, k)) = 1 iff p is a 
subsequence of some substring v of w such that |u| < k. 

Finally, the substring (subsequence, episode, reps.) pattern language, denoted 
by L‘*’'{p) (L“^{p), L‘’“{{p,k)), resp.), is a language defined by substring (subse- 
quence, episode, resp.) pattern class. 

Remark that the substring pattern languages is a subclass of the episode 
pattern class, since L^^fip) = |p|)) for any p G E* . Subsequence pat- 

tern languages is also an subclass of the episode pattern class, since L‘*’^{p) = 
L“^‘‘{{p, oo)) for any p G E*. 

3 PAC-Learnability 

Valiant [35] introduced the PAC-leaning model as a formal model of concept 
learning from examples. An excellent textbook on this topic is written by Kearns 
and Vazirani [23]. This section briefly summarizes the PAC-learnability of the 
languages we defined in the last section. 

For a pattern class C = {II, m), a pair {w,m{w,p)) is called an example of a 
pattern p G II for w G E* . It is called a positive example if m{w,p) = 1) and is 
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called a negative example otherwise. For an alphabet S and an integer n > 0, 
we denote by the set {w e U* : |w| < n}. 

Definition 2. A pattern class C = (i7, m) is polynomial-time learnable if there 
exist an algorithm A and a polynomial poly{-,-,-) which satisfy the following 
conditions for any pattern p G II, any real numbers e, S {0 < e, 6 < 1), any 
integer n>0, and any probability distribution P on : 

(a) A takes e, S, and n, as inputs. 

(b) A may call EXAMPLE, which generates examples of the pattern p G II, 
randomly according to the probability distribution P on Xl"! . 

(c) A outputs a pattern q G II satisfying P{Lc{p) U Lc{q) — Lc{p) fl Lc{q)) < e 
with probability at least 1 — i5. 

(d) The running time of A is bounded by poly{l/e,l/S,n). 

The PAC-learnability is well-characterized in terms of Vapnik-Chervonenkis 
dimension [10] and the existence of Occam algorithm [9,11]. 

Theorem 1 ([10,29]). A pattern class C = {II, m) is polynomial-time learnable 
if the following conditions hold. 

(a) C is polynomial dimension, i.e., there exists a polynomial d{n) such that 
|{Lc(p)n XN :pGn}\< 

(b) There exists a polynomial-time algorithm called polynomial-time hypothesis 
finder for C which produces a hypothesis from a sequence of examples such 
that it is consistent with the given examples. 

On the other hand, the pattern class C is polynomial-time learnable only if there 
exists a randomized polynomial-time hypothesis finder for C . 

It is not hard to verify that all of substring languages, subsequence languages, 
and episode pattern languages are of polynomial dimension. Therefore the prob- 
lem is whether there exists a (randomized) polynomial-time hypothesis finder 
for these classes. For substring languages, we can easily construct a polynomial- 
time hypothesis finder for it, since the candidate patterns must be a substring 
of any positive examples, so that we have only to consider quadratic numbers of 
candidates. Indeed, we can develop a linear-time hypothesis finder for it [15], by 
utilizing the data structure called Generalized Suffix Tree. See also our recent 
generalization of it for finding pairs of substring patterns [6]. On the other hand, 
the consistency problem for subsequence languages language is NP-hard [26,22, 
27]. Thus we have the following theorem. 

Theorem 2. The substring languages is polynomial-time PAG -learnable. On the 
other hand, subsequence languages nor episode languages is not polynomial-time 
PAO-learnable under the assumption RPyfNP. 

Query learning model due to Angluin [2], and identification in the limit due 
to Gold [14] are also important learning models. We discussed in [25], the 
learnabilities of finite unions of subsequence languages in these models. 
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4 Finding Best Patterns Efficiently 

From a practical viewpoint, we have to find a good pattern which discriminate 
positive examples from negative examples. We formulate the problem by follow- 
ing our paper [15]. Let good be a function from U* x 2^ x 2^ to the set of 
real numbers. We formulate the problem of finding the best pattern according 
to the function good as follows. 

Definition 3 (Finding the best pattern in C = (77, m) according to good). 

Input: Two sets S', T C 17* of strings. 

Output: A pattern p € 77 that maximizes the value good{p, S,T) . 

Intuitively, the value good{p, S,T) expresses the goodness to distinguish S from 
T using the rule specified by the pattern p. We may choose an appropriate 
function good according to each applications. For example, the values, entropy 
information gain, and gini index are frequently used. Essentially these statistical 
measures are defined by the numbers of strings that satisfy the rule specified by 
p. Thus we can describe the measure in the following form: 

good{p, S, T) = f{xp, pp, [S’], |T|), 

where Xp= \S (1 Lc{p)\ and Pp = \T (1 Lc{p)\. When the sets S and T are fixed, 
the values Xmax = |5'| and j/max = l^l are unchanged. Thus we abbreviate the 
function /(x, j/, a;max, 2 /max) to f{x,p) in the sequel. 

Since the function good {p, S, T) expresses the goodness of a pattern p G II 
to distinguish two sets, it is natural to assume that the function / satisfies the 
conicalitp, defined as follows. 

Definition 4. We sap that a function f from [0, x^ax] x [O 7 //max] to real numbers 
is conic if 

— for anp 0 < p < t/max; there exists an xi such that 

• f{x,p) > f{x',p) for anp Q < x < x' < X\, and 

• f{x,p) < f{x',p) for anp xi < x < x' < Xmax- 
~ for anp 0 < x < Xmax; there exists a 2/1 such that 

• f{x,p) > f{x,p') for anp0<p <p' < pi, and 

• fix, p) < fix, p') for anp pi < p < p' < ^max ■ 

Actually, all of the above statistical measures are conic. We remark that any 
convex function is conic. We assume that / is conic and can be evaluated in 
constant time in the sequel. 

We now describe the basic idea of our algorithms. Fig. 1 shows a naive algo- 
rithm which exhaustively searches all possible patterns one by one, and returns 
the best pattern that gives the maximum score. Since most time consuming part 
is obviously the lines 5 and 6 , in order to reduce the search time, we should (1) 
reduce the possible patterns in line 3 dpnamicallp by using some appropriate 
pruning heuristics, and (2) speed up to computing [S' fl Lcip)\ and \T fl Lcip)\ 
for each pattern p. We deal with (1) in the next section, and we treat (2) in 
Section 7. 
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1 pattern FindBestPattern(StringSet S, T) 

2 double maxVal — —oo\ 

3 pattern maxPat = null-, 

4 for all possible pattern p £ U do 

5 X = \S r\ Lc[p)\-, 

6 y = \T r\ Lc{p)\-, 

7 val = f{x,y)-, 

8 if val > maxVal then 

9 maxVal = val-, 

10 maxPat = p\ 

11 return maxPat-, 



Fig. 1. Exhaustive search algorithm for finding the best pattern in C = {II, m) 



5 Pruning Heuristics 

In this section, we introduce some pruning heuristics, inspired by Morishita and 
Sese [28], to construct a practical algorithm to find the best subsequence pattern 
and the best episode pattern, without sacrificing the optimality of the solution. 

Lemma 1 ([15]). For any patterns p,q€ U with Lc{p) D Lc{q), we have 

f{xq,yq) < max{/(xp,j/p),/(xp,0),/(0,j/p),/(0,0)}. 

Lemma 2 ([15,16]). For any subsequence patterns p,q € S* such that p is 
a subsequence of q, we have L‘“‘{p) D L“'‘{q). Moreover, for I > k, we have 
L‘^‘{{p,l)) D L‘^^{{q,k)). 

In Fig. 2, we show our algorithm to find the best subsequence pattern from 
given two sets of strings, according to the function /. Optionally, we can specify 
the maximum length of subsequences. We use the following data structures in 
the algorithm. 

StringSet Maintain a set S of strings. 

— int numOISu bseq(string p) : return the cardinality of the set {w G S' | 
p is a subsequence of w}. 

PriorityQueue Maintain strings with their priorities. 

— bool empty{) : return true if the queue is empty. 

— void push(string w, double priority) : push a string w into the queue with 
priority priority. 

— (string, double) popQ : pop and return a pair {string, priority), where 
priority is the highest in the queue. 

The next theorem guarantees the completeness of the algorithm. 

Theorem 3 ([15]). Let S and T be sets of strings, and £ be a positive inte- 
ger. The algorithm FindMaxSubsequence (S , T, £) will return a string w that 
maximizes the value good{w, S, T) among the strings of length at most £. 
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1 string FindMaxSubsequence(StringSet S, T, int maxLength = oo) 

2 string prefix, seq, maxSeq-, 

3 double upperBound = oo, maxVal = — oo, val; 

4 int X, y, 

5 PriorityQueue queue, /* Best First Search*/ 

6 queue. push/” , oo); 

7 while not queue. emptyi) do 

8 [prefix, upperBound) = queue. pop{)-, 

9 if upperBound < maxVal then break; 

10 foreach c € E do 

11 seq= prefix+ c; /* string concatenation */ 

12 x = S .numOfSubseq[seq); 

13 y — T.nuinOfSubseq(seg); 

U val = f(x,y); 

15 if val > maxVal then 

16 maxVal = val; 

17 maxSeq = seq; 

18* upperBound = max{f[x, y), f[x, 0), /(O, y), /(O, 0)}; 

19 if \seq\ < maxLength then 

20 queue. pusb[seq, upperBound); 

21 return maxSeq; 



Fig. 2. Algorithm FindMaxSubsequence. 



6 Finding Best Threshold Values 

We now show a practical algorithm to find the best episode patterns. We should 
remark that the search space of episode patterns is E* x Af, while the search 
space of subsequence patterns was S*. A straight-forward approach based on 
the last subsection might be as follows. First we observe that the algorithm 
FindMaxSubsequence in Fig. 2 can be easily modified to find the best episode 
pattern (v, k) for any fixed threshold k: we have only to replace the lines 12 
and 13 so that they compute the numbers of strings in S and T that match with 
the episode pattern (seq, k), respectively. Thus, for each possible threshold value 
k, repeat his algorithm, and get the maximum. A short consideration reveals 
that we have only to consider the threshold values up to /, that is the length of 
the longest string in given S and T. 

However, here we give a more efficient solution. We consider the following 
problem, that is a subproblem of finding the best episode pattern. 

Definition 5 (Finding the best threshold value). 

Input: Two sets S,T C E* of strings, and a string v € E* . 

Output: Integer k that maximizes the value f[xi^„y),yi^yy)), where = 

\SfiL‘^‘[{v,k))\ andy^yy) = \T n L‘”/{v, k))\. 
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For strings v, s € S* ,we define the threshold value 0 of w for s by 0 = min{k € 
Af I s G L‘”^((v, k})}. If no such value, let 6* = oo. Note that s ^ L’‘^^{{v, k)) for 
any k < 9, and s G L‘^‘{{v, k)) for any 6 < k. For a set S of strings and a string 
V, let us denote by Os^v the set of threshold values of v for some s € S. 

A key observation is that a best threshold value for given S,T C A* and a 
string V € A* can be found in 0s, v U 0t,v without loss of generality. Thus we 
can restrict the search space of the best threshold values to 0s,v U Ot,v 

From now on, we consider the numerical sequence treat 

{j/(t>,fc)}^o the same way.) It follows from Lemma 2 that the sequence is 
non-decreasing. Moreover, remark that 0 < < jS”! for any k. Moreover, 

= ■ ■ G where I is the length of the longest string in 
S. Hence, we can represent "'^hh a list having at most minus'!, 

elements. We call this list a compact representation of the sequence {x(V, fc)}^o 
{CRS, for short). 

We show how to compute CRS for each v and a fixed S. Observe that 
increases only at the threshold values of v for some s G S. By computing a 
sorted list of threshold values of v for all s G S, we can construct the CRS of 
{^{v,k)}T=o- If using the counting sort, we can compute the CRS for z; G If* in 
0{\S\ml + jSj) = 0(||S||m) time, where m = |n|. 

We emphasize that the time complexity of computing the CRS of {x(v^k)}'^o 
is the same as that of computing X(^y^k) for a single k {0 < k < oo), by our method. 

After constructing CRSs x of V com- 

pute the best threshold value in 0(|a;| -I- \y\) time. Thus we have the following, 
which give an efficient solution to the finding the best threshold value problem. 

Lemma 3. Given S,T C A* and v G A*, we can finding the best threshold 
value in 0((||S'|| -I- ||T||)|w|) time, where ||S'|| and ||T|| represent the total length 
of the strings in S and T, respectively. 

By substituting this procedure into the algorithm FindMaxSubsequen.ee, we 
get an algorithm to find a best episode pattern from given two sets of strings, 
according to the function /, shown in Fig. 3. We add a method crs{v) to the 
data structure StringSet that returns CRS of mentioned above. 

By Lemma 1 and 2, we can use the value upperBound at (x^^_oo), y<ti,oo)) 
to prune branches in the search tree computed at line 20 marked by (*). We 
emphasize that the values at {x (y ^k.) , y {v ,k)) is insufficient as upperBound. Note 
also that oo) and y^y^oo) can be extracted from x and y in constant time, 
respectively. The next theorem guarantees the completeness of the algorithm. 

Theorem 4 ([16]). Let S and T he sets of strings, and £ be a positive integer. 
The algorithm FindBestEpisode(S, T, £) will return an episode pattern that 
maximizes f{x^y^k),y{v,k)), with x^y^k) = \S D L‘”‘{{v,k))\ and yi^y^k) = |T n 
L‘’“{{v,k))\, where v varies any string of length at most £ and k varies any 
integer. 
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1 string FindBestEpisode (StringSet S, T, int €) 

2 string prefix, v; 

3 episodePattern maxSeq; /* pair o/ string and int */ 

4 double upperBound = oo, maxVal = — oo, val; 



5 

6 

7 

8 

9 

10 
11 
12 
13 
U 

15 

16 

17 

18 
19 

20 {*) 

21 

22 

23 



int k'-, 

CompactRepr x, y; /* CRS */ 

PriorityQueue queue', /* Best First Search*/ 

queue. push/” , oo); 

while not queue. emptyi) do 

[prefix, upperBound) = queue. pop{); 
if upperBound < maxVal then break; 
foreach c d S do 

V = prefix+ c; /* string concatenation */ 

X = S.crs[v); 
y = T.crs[v)', 

k' = argmaxj.{/(a;(„,fe), and val = f[x(^,k'),y(v,k')); 
if val > maxVal then 
maxVal = val; 
maxEpisode = (v,k'); 

upperBound = max{/(a;(„,oo> , y<«,oo)), /(*<«, oo) , 0), 

/(0,J/(„.oc>),/(0,0)}; 

if upperBound > maxVal and |u| < £ then 
queue. pusb(v, upperBound); 
return maxEpisode; 



Fig. 3. Algorithm EindBestEpisode. 



7 Efficient Data Structures to Count Matched Strings 

In this section, we introduces some efficient data structures to speed up answering 
the queries. 

First we pay our attention to the following problem. 

Definition 6 (Counting the matched strings). 

Input: A finite set S C E* of strings. 

Query: A string seq G E* . 

Answer: The cardinality of the set S fl L“‘‘{seq). 

Of course, the answer to the query should be very fast, since many queries 
will arise. Thus, we should preprocess the input in order to answer the query 
quickly. On the other hand, the preprocessing time is also a critical factor in 
some applications. In this paper, we utilize automata that accept subsequences 
of strings. 

In [17], we considered a subsequence automaton as a deterministic complete 
finite automaton that recognizes all possible subsequences of a set of strings. 
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a 




Fig. 4. Subsequence automaton for S = {abab, abb, bb}, where E = {a, b}. Each number 
ou a state denotes the number of matched strings. For example, by traverse the states 
according to a string ab, we reach the state whose number is 2. It corresponds to the 
cardinality \L““'{ab) n S'] = 2, since ab is a subsequence of both abab and abb, but is 
not s subsequence of bb. 



that is essentially the same as the directed acyclic subsequence graph (DASG) 
introduced by Baeza-Yates [5] . We showed an online construction of subsequence 
automaton for a set of strings. Our algorithm runs in 0{\E\(m + k) + N) time 
using 0{\S\m) space, where |Y| is the size of alphabet, N is the total length of 
strings, and m is the number of states of the resulting subsequence automaton. 
We can extend the automaton so that it answers the above Counting the matched 
strings problem in a natural way (see Fig. 4). 

Although the construction time is linear to the size m of automaton to be 
built, unfortunately m = 0{n^) in general, where we assume that the set S 
consists of k strings of length n. In fact, we proved the lower bound m = i?(n^) 
for any k > 0 [34]. Thus, when the construction time is also a critical factor, as in 
our application, it may not be a good idea to construct subsequence automaton 
for the set S itself. Here, for a specified parameter mode > 0, we partition the set 
S into d = k/mode subsets S\, S 2 , ■ ■ ■ , Sd oi &t most mode strings, and construct 
d subsequence automata for each Si. When asking a query seq, we have only 
to traverse all automata similutaneously, and return the sum of the answers. In 
this way, we can balance the preprocessing time with the total time to answer 
(possibly many) queries. In [15], we experimentally evaluated the optimal value 
of the parameter mode. 

We now analyze the complexity of episode pattern matching. Given an episode 
pattern {v, k) and a string t, determine whether t G k)) or not. This prob- 

lem can be answered by filling up the edit distance table between v and t, where 
only insertion operation with cost one is allowed. It takes 0{mn) time and space 
using a standard dynamic programming method, where m = |v| and n = |t|. 
For a fixed string, automata-based approach is useful. We use the Episode Di- 
rected Acyclic Subsequence Graph (EDASG) for string t, which was introduced 
by Trofcek in [33]. Hereafter, let EDASG{t) denote the EDASG for t. With the 
use of EDASG{t), episode pattern matching can be answered quickly in prac- 
tice, although the worst case behavior is still 0{mn). EDASG(t) is also useful to 
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Fig. 5. EDASGit) for t = aabaabahb. Solid arrows denote the forward edges, and broken 
arrows denote the backward edges. 



compute the threshold value 6 of given v for t quickly in practice. As an exam- 
ple, EDASG(aabbab) is shown in Fig. 5. When examining if an episode pattern 
{abb, 4:) matches with t or not, we start from the initial state 0 and arrive at 
state 6, by traversing the forward edges spelling abb. It means that the shortest 
prefix of t that contains abb as a subsequences is t[0 : 6] = aabaab, where t[i : j] 
denotes the substring ti+i ■ ■ - tj of t. Moreover, the difference between the state 
numbers 6 and 0 corresponds to the length of matched substring aabaab of t, 
that is, 6 — 0 = |aa6aa6|. Since it exceeds the threshold 4, we move backwards 
spelling bba and reach state 1. It means that the shortest suffix of f[0 : 6] that 
contains abb as a subsequence is t[I : 6] = abaab. Since 6—1 > 4, we have to 
examine other possibilities. It is not hard to see that we have only to consider 
the string t[2 : *]. Thus we continue the same traversal started from state 2, 
that is the next state of state 1. By forward traversal spelling abb, we reach state 
8, and then backward traversal spelling bba bring us to state 4. In this time, 
we found the matched substring t[4 : 8] = abab which contains the subsequence 
abb, and the length 8 — 4 = 4 satisfies the threshold. Therefore we report the 
occurrence and terminate the procedure. 

It is not difficult to see that the EDASGs are useful to compute the threshold 
value of V for a fixed t. We have only to repeat the above forward and backward 
traversal up to the end, and return the minimum length of the matched sub- 
strings. 

8 Concluding Remarks 

In this paper, we focused on the pattern discovery problem for substring, subse- 
quence, and episode patterns to illustrate the basic ideas. We have already gen- 
eralized it for various ways: considering variable-length-don’t care patterns [19] 
and their variation [32], finding correlated patterns from given a set of strings 
with numeric attribute values [8] , and finding a boolean combination of patterns 
instead of single pattern [6,18]. We also show another data structure to support 
fast counting of matched strings [21,20], and application to extend BONSAI 
system [7]. 
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Abstract. We present a survey of recent results concerning the theo- 
retical and empirical performance of algorithms for learning regularized 
least-squares classifiers. The behavior of these family of learning algo- 
rithms is analyzed in both the statistical and the worst-case (individual 
sequence) data-generating models. 



1 Regularized Least-Squares for Classification 

In the pattern classification problem, some unknown source is supposed to gen- 
erate a sequence Xi,X2,... of instances (data elements) Xt G X, where X is 
usually taken to be for some fixed d. Each instance Xt is associated with a 
class label yt G y, where 3^ is a finite set of classes, indicating a certain semantic 
property of the instance. For instance, in a handwritten digit recognition task, Xt 
is the digitalized image of a handwritten digit and its label yt G {0, 1, . . . , 9} is 
the corresponding numeral. A learning algorithm for pattern classification uses a 
set of training examples, that is pairs {xt, yt), to build a classifier f : X ^ y that 
predicts, as accurately as possible, the labels of any further instance generated 
by the source. As training data are usually labelled by hand, the performance 
of a learning algorithm is measured in terms of its ability to trade-off predictive 
power with amount training data. 

Due to the recent success of kernel methods (see, e.g., [8]), linear-threshold 
(L-T) classifiers of the form f{x) = SGN(i(;^a;), where w G is the classifier 
parameter and SGn(-) G {— 1,-|-1} is the signum function, have become one of 
the most popular approaches for solving binary classification problems (where 
the label set is 3^ = {— 1,-|-1}). In this paper, we focus on a specific family 
of algorithms for learning L-T classifiers based on the solution of a regularized 
least-squares problem. 

The basic algorithm within this family is the second-order Perceptron algo- 
rithm [2]. This algorithm works incrementally: each time a new training example 
(xt,yt) is obtained, a prediction yt = SGN {wj Xt) for the label yt of Xt is com- 
puted, where Wt is the parameter of the current L-T classifier. If yt ^ yt, then 
Wt is updated and a new L-T classifier, with parameter ibt+i, is generated. The 
second-order Perceptron starts with the constant L-T classifier Wi = (0, . . . ,0) 
and, at time step t, computes tut+i = {al + St Sj)~^Styt, where a > 0 is a 
parameter, I is the identity matrix, and St is the matrix whose columns are the 
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previously stored instances x^; that is, those instances Xg (1 < s < t) such that 
SGn(ii;J Xg) ^ yg. Finally, y^ is the vector of labels of the stored instances. Note 
that, given St-i, 2 /t_i, and (xt,yt), the parameter Wt+i can be computed in 
time 0{d^). Working in dual variables (as when kernels are used), the update 
time becomes quadratic in the number of stored instances. The connection with 
regularized least-squares is revealed by the identity 



= arginf - ygf + a ||u||^ 

\seMt 

where Ait is the set of indices s of stored instances Xg after the first t steps. 

Similarly to the classical convergence theorems for the Perceptron algorithm 
(see [1,6,7]), in [2] a theorem is proven bounding the number of updates per- 
formed by the second-order Perceptron on an arbitrary sequence of training 
examples. In the simplest case where the training sequence is linearly separa- 
ble by some (unknown) unit-norm vector u with margin 7 , the second-order 
Perceptron converges to a linear separator after at most 



7\ 



d 

(a + u^SS^u) In ( 1 -I- Aj /o) 

i=l 



many updates, where S is the matrix of stored instances after convergence is at- 
tained and Ai, ... ,Xd are the eigenvalues of S 5”^. The quadratic form S S^u 
takes value in the interval [Amin) Amaxj- This bound is often better than the cor- 
responding bound / 7 ^^r the Perceptron algorithm, where the de- 

pendence on the eigenvalues is linear rather than logarithmic. This advantage 
appears to be real as the second-order algorithm is observed to converge faster 
than the standard Perceptron on real-world datasets. 



2 Application to Online Learning 

Incremental algorithms, such as the second-order Perceptron, are natural tools 
in on-line learning applications (e.g., the real-time categorization of stories pro- 
vided by a newsfeed). In these scenarios a stream of instances is fed into the 
learning algorithm which uses its current L-T classifier to predict their labels. 
Occasionally, the algorithm can query the label yt of a selected instance Xt in 
order to form a training instance (xt,yt) which can be immediately used to im- 
prove its predictive performance. Hence, unlike the traditional learning setup, 
here the algorithm can selectively choose the instances to be used for training, 
but it has to do so while simultaneously minimizing the number of prediction 
mistakes over the entire sequence of instances, including those instances whose 
label has not been queried. The second-order Perceptron can be directly applied 
in this scenario by simply querying the label of an instance Xt with probability 
roughly proportional tol/(l-|-|m 73 ^t|/C),so that the label of instances on which 
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Fig. 1. Behaviour of the randomized second-order Perceptron for on-line learning on 
a newsstory categorization task based on the Reuters Corpus Volume 1. For each value 
of the parameter C a pair of curves is plotted as a function of the number of instances 
observed. The increasing curve is the average F-measure (a kind of accuracy) and the 
decreasing curve is the rate of labels queried. Note that, for the range of parameter 
values displayed, the performance does not get significantly worse as the parameter 
value decreases causing the label rate to drop faster. In all cases, the performance 
remains close to that of the algorithm that queries all labels (corresponding to the 
choice C — >■ oo). 



the current L-T classifier achieves a large margin ‘^t \ have a small probability 
of being queried. (The empirical behaviour of this algorithm for different values 
of the parameter C is reported in Figure 1.) After this randomized step, if the 
label yt is actually queried and SGN {wj Xt) yf yt, then the second-order update 
described earlier is performed. Surprisingly, the expected number of mistakes 
made by this randomized variant of the second-order Perceptron algorithm can 
be bounded by the same quantity that bounds the number of updates made (on 
the same sequence) by the standard deterministic second-order Perceptron using 
each observed instance for its training [5]. 

3 The Statistical Learning Model 

Unlike the previous analyses, were no assumption is made on the way the la- 
belled instances are generated, in statistical pattern classification instances Xt 
are realizations of i.i.d. draws from some unknown probability distribution on X, 
and then binary labels yt G {—I, +1} are random variables correlated to Xt via 
some unknown noise model specifying the conditional probability P(yt = 1 | Xt). 
Hence, in this model a set S of training examples is a statistical sample and 
learning becomes equivalent to a problem of statistical inference. The goal in 
statistical pattern classification is to bound, with high probability with respect 
to the random draw of the training sample S, the probability that the classifier. 
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Fig. 2. Behaviour of the deterministic second-order Perceptron for on-line learning 
(called selective sampling in the legend) on the same newsstory categorization task 
of Figure 1. As in Figure 1, the increasing curve is the average F-measure and the 
decreasing curve is the rate of labels queried. Note that the label rate of the selective 
sampling algorithm decreases very fast and the performance of the algorithm eventually 
reaches the performance of the second-order Perceptron which is allowed to observe all 
labels. 



generated by the learning algorithm run on S, makes a mistake on a further 
random example drawn from the same i.i.d. source. This probability is usually 
called statistical risk. Building on the second-order Perceptron update bounds 
described above, holding for arbitrary sequences of data, it is possible to show 
that the risk of a L-T classifier generated by the second-order perceptron is 
bounded by 



h C 



n 



1 , n 
-In- 
n 0 



with probability at least 1 — 5 with respect to the random draw of a training 
sample of size n. Here c > 0 is a universal constant and M„ is the number of 
updates performed by the second-order Perceptron while being run once on the 
training sample. This bound holds for any arbitrary probability distribution on 
X and any arbitrary noise model. 

Within the statistical learning model, the randomized step in the on-line 
learning scenario (where the algorithm queries the labels) is not needed anymore. 
Indeed, by assuming a specific noise model, different label-querying strategies 
for the second-order Perceptron can be devised that improve on the empirical 
performance of the randomized rule. In particular, assume for simplicity that 
instances are arbitrarily distributed on the unit sphere in and assume some 
unit-norm vector tt G exists such that V{yt = 1 | Xt) = (1 -|- u^Xt)/2. Then, 
it can be shown that the L-T classifier computed by the second-order Perceptron 
is a statistical estimator of u (which is also the parameter of the Bayes optimal 
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classifier for this model) [3]. This can be used to devise a querying strategy based 
on the computation of the confidence interval for the estimator. In practice, the 
randomized querying rule introduced earlier is replaced by a deterministic rule 
prescribing to query the label of Xt whenever \wj Xt\ < Cy / (In t) /Nt, where c is 
some constant (the choice c = 5 works well in practice) and Nt is the number of 
labels queried up to time t. The empirical behavior of this deterministic version 
of the second-order Perceptron for on-line applications is depicted in Figure 2. 
Theoretical bounds for this specific algorithm proven in [3] show that, under 
additional conditions on the distribution according to which instances are drawn, 
the rate of queried labels decreases exponentially in the number of observed 
instances, while the excess rate of mistakes with respect to the Bayes optimal 
classifier grows only logarithmically in the same quantity. 

The fact that, under the linear noise model described above, the L-T classifier 
learned by the second-order Perceptron estimates the parameter of the model 
can be also used to apply the second-order algorithm to the task of hierarchical 
classification, a variant of the classification task in which the labels given to 
instances are union of paths taken from a fixed taxonomy imposed on the set y 
of class labels. Some preliminary results in this direction are shown in [4]. 
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Abstract. Probabilistic inductive logic programming, sometimes also 
called statistical relational learning, addresses one of the central ques- 
tions of artificial intelligence: the integration of probabilistic reasoning 
with first order logic representations and machine learning. A rich variety 
of different formalisms and learning techniques have been developed. In 
the present paper, we start from inductive logic programming and sketch 
how it can be extended with probabilistic methods. 

More precisely, we outline three classical settings for inductive logic pro- 
gramming, namely learning from entailment, learning from interpreta- 
tions, and learning from proofs or traces, and show how they can be 
used to learn different types of probabilistic representations. 



1 Introduction 

In the past few years there has been a lot of work lying at the intersection of prob- 
ability theory, logic programming and machine learning [40,15,42,31,35,18,25,21, 
2,24]. This work is known under the names of statistical relational learning [14, 
11], probabilistic logic learning [9], or probabilistic inductive logic programming. 
Whereas most of the existing works have started from a probabilistic learning 
perspective and extended probabilistic formalisms with relational aspects, we 
will take a different perspective, in which we will start from inductive logic pro- 
gramming and study how inductive logic programming formalisms, settings and 
techniques can be extended to deal with probabilistic issues. This tradition has 
already contributed a rich variety of valuable formalisms and techniques, includ- 
ing probabilistic Horn abduction by David Poole, PRISMs by Sato, stochastic 
logic programs by Eisele [12], Muggleton [31] and Cussens [4], Bayesian logic 
programs [22,20] by Kersting and De Raedt, and Logical Hidden Markov Mod- 
els [24]. 

The main contribution of this paper is the introduction of three probabilistic 
inductive logic programming settings which are derived from the learning from 
entailment, from interpretations and from proofs settings of the field of induc- 
tive logic programming [6]. Each of these settings contributes different notions 
of probabilistic logic representations, examples and probability distributions. 
The first setting, probabilistic learning from entailment, combines key princi- 
ples of the well-known inductive logic programming system FOIL [41] with the 
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naive Bayes’ assumption; the second setting, probabilistic learning from inter- 
pretations, incorporated in Bayesian logic programs [22,20], integrates Bayesian 
networks with logic programming; and the third setting, learning from proofs, 
incorporated in stochastic logic programs [12,31,4], upgrades stochastic context 
free grammars to logic programs. The sketched settings (and their instances 
presented) are by no means the only possible settings for probabilistic induc- 
tive logic programming. Nevertheless, two of the settings have - to the authors’ 
knowledge - not been introduced before. Even though it is not our aim to pro- 
vide a complete survey on probabilistic inductive logic programming (for such 
a survey, see [9]), we hope that the settings will contribute to a better under- 
standing of probabilistic extensions to inductive logic programming and will also 
clarify some of the logical issues about probabilistic learning. 

This paper is structured as follows: in Section 2, we present the three induc- 
tive logic programming settings, in Section 3, we extend these in a probabilistic 
framework, in Section 4, we discuss how to learn probabilistic logics in these 
three settings, and finally, in Section 5, we conclude. The Appendix contains a 
short introduction to some logic programming concepts and terminology that 
are used in this paper. 



2 Inductive Logic Programming Settings 

Inductive logic programming is concerned with finding a hypothesis H (a logic 
program, i.e. a definite clause program) from a set of positive and negative ex- 
amples P and N. More specifically, it is required that the hypothesis H covers 
all positive examples in P and none of the negative examples in N. The rep- 
resentation language chosen for representing the examples together with the 
covers relation determines the inductive logic programming setting [6]. Various 
settings have been considered in the literature, most notably learning from en- 
tailment [39] and learning from interpretations [8,17], which we formalize below. 
We also introduce an intermediate setting inspired on the seminal work by Ehud 
Shapiro [43], which we call learning from proofs. 

Before formalizing these settings, let us however also discuss how background 
knowledge is employed within inductive logic programming. For the purposes of 
the present paper^, it will be convenient to view the background knowledge B 
as a logic program (i.e. a definite clause program) that is provided to the induc- 
tive logic programming system and fixed during the learning process. Under the 
presence of background knowledge, the hypothesis H together with the back- 
ground theory B should cover all positive and none of the negative examples. 
The ability to provide declarative background knowledge to the learning engine 
is viewed as one of the strengths of inductive logic programming. 



^ For the learning from interpretations setting, we slightly deviate from the standard 
definition in the literature for didactic purposes. 
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2.1 Learning from Entailment 

Learning from entailment is by far the most popular inductive logic programming 
systems and it is addressed by a wide variety of well-known inductive logic 
programming systems such as FOIL [41], Progol [30], and Aleph [44]. 

Definition 1. When learning from entailment, the examples are definite clauses 
and a hypothesis H covers an example e w.r.t. the background theory B if and 
only if BLI H \= e. 

In many well-known systems, such as FOIL, one requires that the examples 
are ground facts, a special case of definite clauses. To illustrate the above set- 
ting, consider the following example inspired on the well-known mutagenicity 
application [45]. 

Example 1. Consider the following facts in the background theory B. They de- 
scribe part of molecule 225. 

molecule (225) . 
logmutag(225,0 . 64) . 
lumo (225, -1.785) . 
logp(225,1.01) . 

nitro(225, [f l_4,f l_8,f 1_9] ) . 
atom(225,fl_l,c, 21, 0.187) . 
atom(225,f l_2,c,21,-0. 143) . 
atom(225,fl_3,c, 21,-0. 143) . 
atom(225 ,f 1_4, c ,21 ,-0 . 013) . 
atom(225 ,fl_5 ,o, 52,-0 . 043) . 

ring_size_5(225, [f 1_5 ,f 1_1 ,f l_2,f 1_3, f 1_4] ) . 
hetero_aromatic_5_ring(225 , [f 1_5 ,f 1_1 ,f 1_2 ,f 1_3, f 1_4] ) . 



bond(225,f l_l,fl_2,7) . 
bond(225,f l_2,fl_3,7) . 
bond(225,f l_3,fl_4,7) . 
bond(225,f l_4,fl_5,7) . 
bond(225,f l_5,fl_l,7) . 
bond(225,f l_8,fl_9,2) . 
bond(225,f l_8,fl_10,2) . 
bond(225,fl_l,fl_ll,l) . 
bond(225,f l_ll,fl_12,2) 
bond(225,f l_ll,fl_13,l) 



Consider now the example mutagenic(225). It is covered by the following clause 

mutagenic(M) :- nitro(M,Rl), logp(M,C), C > 1 . 

Inductive logic programming systems that learn from entailment often em- 
ploy a typical separate-and-conquer rule-learning strategy [13]. In an outer loop 
of the algorithm, they follow a set-covering approach [29] in which they repeat- 
edly search for a rule covering many positive examples and none of the negative 
examples. They then delete the positive examples covered by the current clause 
and repeat this process until all positive examples have been covered. In the in- 
ner loop of the algorithm, they typically perform a general-to-specific heuristic 
search employing a refinement operator under 0-subsumption [39]. 

2.2 Learning from Interpretations 

The learning from interpretations settings [8] has been inspired on the work on 
boolean concept-learning in computational learning theory [47]. 
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Definition 2. When learning from interpretations, the examples are Herhrand 
interpretations and a hypothesis H covers an example e w.r.t. the background 
theory B if and only if e is a model of B U H. 

Herbrand interpretations are sets of true ground facts and they completely de- 
scribe a possible situation. 



Example 2. Consider the interpretation / which is the union of B 



{father (henry, bill) , 
father (brian,bonnie) , 
father(carl,dennis) , 
mother (cuin.bonnie) , 
mother (bonnie , Cecily) , 
founder (alan) . founder 



father (alan,betsy) , 
father (bill, carl) , 
mother (ann, bill) , 
mother (alice, benny) , 
mother (cecily , dennis 
(an), founder (brian) . 



father (alan, benny) , 
father (benny, cecily) , 
mother (ann, betsy) , 
mother (betsy, carl) , 

, founder (henry) . 
founder (alice) . 



and C = {carrier(alcui), carrier(ann), carrier(betsy)}. It is covered by the 
clause c 



carrier(X) mother(M,X) ,carrier(M) ,father(F,X) ,carrier(F) . 

This clause covers the interpretation I because for all substitutions 6 such that 
body{c)9 C / holds, it also holds that head(c)9 G I. 

The key difference between learning from interpretations and learning from 
entailment is that interpretations carry much more - even complete - informa- 
tion. Indeed, when learning from entailment, an example can consist of a single 
fact, whereas when learning from interpretations, all facts that hold in the ex- 
ample are known. Therefore, learning from interpretations is typically easier and 
computationally more tractable than learning from entailment, cf. [6]. 

Systems that learn from interpretations work in a similar fashion as those that 
learn from entailment. There is however one crucial difference and it concerns 
the generality relationship. A hypothesis G is more general than a hypothesis 
S if all examples covered by S are also covered by G. When learning from 
entailment, G is more general than S if and only if G ^ S', whereas when 
learning from interpretations, when S |= G. Another difference is that learning 
from interpretations is well suited for learning from positive examples only. For 
this case, a complete search of the space ordered by 0-subsumption is performed 
until all clauses cover all examples [7]. 



2.3 Learning from Proofs 

Because learning from entailment (with ground facts as examples) and interpre- 
tations occupy extreme positions w.r.t. the information the examples carry, it is 
interesting to investigate intermediate positions. Ehud Shapiro’s Model Inference 
System (MIS) [43] fits nicely within the learning from entailment setting where 
examples are facts. However, to deal with missing information, Shapiro employs 
a clever strategy: MIS queries the user for missing information by asking her for 




Probabilistic Inductive Logic Programming 



23 



the truth- value of facts. The answers to these queries allow MIS to reconstruct 
the trace or the proof of the positive examples. Inspired by Shapiro, we define 
the learning from proofs setting. 

Definition 3. When learning from proofs, the examples are ground proof-trees 
and an example e is covered by a hypothesis H w.r.t. the background theory B if 
and only if e is a proof-tree for HUB. 

At this point, there exist various possible forms of proof-trees. In this paper, we 
will - for reasons that will become clear later - assume that the proof-tree is 
given in the form of an and-tree where the nodes contain ground atoms. More 
formally: 

Definition 4. t is a proof-tree for T if and only if t is a rooted tree where for 
every node n € t with children child{n) satisfies the property that there exists a 
substitution 9 and a clause c such that n = head{c)9 and child{n) = body{c)9. 

Example 3. Consider the following definite clause grammar. 

sentence(A, B) noun_phrase(C, A, D) , verb_phrase(C, D, B) . 
noun_phrase(A, B, C) article(A, B, D) , noun(A, D, C) . 
verb_phrase(A, B, C) intrcUisitive_verb(A, B, C) . 
article (singular, A, B) terminal(A, a, B) . 

article (singular, A, B) terminal(A, the, B) . 

article (plural. A, B) terminal(A, the, B) . 

noun(singular , A, B) terminal(A, turtle, B) . 
noun(plural. A, B) terminal(A, turtles, B) . 

intrEuisitive_verb(singular , A, B) terminal(A, sleeps, B) . 
intransitive_verb (plural , A, B) terminal(A, sleep, B) . 
terminal ( [A I B] , A,B) . 

It covers the following proof tree u (where abbreviated accordingly): 




Proof-trees contain - as interpretations - a lot of information. Indeed, they 
contain instances of the clauses that were used in the proofs. Therefore, it may be 
hard for the user to provide this type of examples. Even though that is generally 
true, there exist specific situations for which this is feasible. Indeed, consider tree 
banks such as the UPenn Wall Street Journal corpus [27], which contain parse 
trees. These trees directly correspond to the proof-trees we talk about. Even 
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though - to the best of the authors’ knowledge (but see [43,3] for inductive logic 
programming systems that learn from traces) - no inductive logic programming 
system has been developed to learn from proof-trees, it is not hard to imagine 
an outline for such an algorithm. Indeed, by analogy with the learning of tree- 
bank grammars, one could turn all the proof-trees (corresponding to positive 
examples) into a set of ground clauses which would constitute the initial theory. 
This theory can then be generalized by taking the least general generalization 
(under 0-subsumption) of pairwise clauses. Of course, care must be taken that 
the generalized theory does not cover negative examples. 

3 Probabilistic Inductive Logic Programming Settings 

Given the interest in probabilistic logic programs as a representation formalism 
and the different learning settings for inductive logic programming, the question 
arises as to whether one can extend the inductive logic programming settings 
with probabilistic representations. 

When working with probabilistic logic programming representations, there 
are essentially two changes: 

1. clauses are annotated with probability values, and 

2. the covers relation becomes a probabilistic one. 

We will use P to denote a probability distribution, e.g. P(x), and the normal 
letter P to denote a probability value, e.g. P{x = v), where w is a state of x. 
Thus, 

a probabilistic covers relation takes as arguments an example e, a hypoth- 
esis H and possibly the background theory B. It then returns a probability 
value between 0 and 1. So, cover s{e,H\JB) = P(e | H,B), the likelihood 
of the example e. 

The task of probabilistic inductive logic programming is then to find the hy- 
pothesis H* that maximizes the likelihood of the data P{E \ H*,B), where E 
denotes the set of examples. Under the usual i.i.d. assumption this results in the 
maximization of P{E \ H* , B) = Yie&E I -®)- 

The key contribution of this paper is that we present three probabilistic 
inductive logic programming settings that extend the traditional ones sketched 
above. 



3.1 Probabilistic Entailment 

In order to integrate probabilities in the entailment setting, we need to find a 
way to assign probabilities to clauses that are entailed by an annotated logic 
program. Since most inductive logic programming systems working under en- 
tailment employ ground facts for a single predicate as examples and the authors 
are unaware of any existing probabilistic logic programming formalisms that 
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implement a probabilistic covers relation for definite clauses in general, we will 
restrict our attention in this section to assign probabilities to facts for a single 
predicate. Furthermore, our proposal in this section proceeds along the lines of 
the naive Bayes’ framework and represents only one possible choice for proba- 
bilistic entailment. It remains an open question as how to formulate more general 
frameworks for working with entailment (one alternative setting is also presented 
in Section 3.3). 

More formally, let us annotate a logic program H consisting of a set of clauses 
of the form p •<— &j, where p is an atom of the form p(Vi, ...,Vn) with the F) 
different variables, and the bi are different bodies of clauses. Furthermore, we 
associate to each clause in H the probability values P{bi \ p); they constitute the 
conditional probability distribution that for a random substitution 0 for which 
p9 is ground and true (resp. false), the query biO succeeds (resp. fails) in the 
knowledge base B. ^ Furthermore, we assume the prior probability of p is given 
as P(p), it denotes the probability that for a random substitution 9, p9 is true 
(resp. false). This can then be used to define the covers relation P{p9 \ H, B) as 
follows (we delete the B as it is fixed): 



P{p9\H) = P{p9\b^9,...,bk9) 



P{bi9,...,bk9\p9)xP{p9) 

P{bi9,...,bk9) 



Applying the naive Bayes assumption yields 



P{p9 I H) 



n,p(MiP^)xP(p^) 

P{b,9,...,bk9) 



( 1 ) 

(2) 



Finally, since we can P{p9 \ H) + P{-^p9 \ H) = 1, we can compute P{p9 \ H) 
without P{bi9, ...,bk9) through normalization. 



Example 4- Consider again the mutagenicity domain and the following anno- 
tated logic program: 



(0.01, 0.21) : mutagenetic(M) atom(M, _, 8, _) 

(0.38, 0.99) : mutagenetic(M) bond(M,. A, 1), atom(M, A, c, 22, _), bond(M, A,_ 2) 



where we denote the first clause by bi and the second one by b 2 , and the vectors 
on the left-hand side of the clauses specify P{bi9 = true \ p9 = true) and P{bi9 = 
true I p9 = false). The covers relation assigns probaility 0.97 to example 225 
because both features fail for 0 = {M 225}. Hence, 

P( mutagenetic(225) = true,h\9 = false, b20 = false) 

= P{ hx9 = false \ muta(225) = true ) 

• P{ b20 = false I muta(225) = true ) 

• P( mutagenetic(225) = true ) 

= 0.99- 0.62-0.31 = 0.19 

The query q succeeds in B if there is a substitution a such that B \= qa. 



2 
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and P( mutae;enetic(225) = false, hi0 = false, hoO 
0.68 = 0.005 . This yields 

P{ muta(225) = true \ bi0 = false, h20 = false}) = 



= false) = 0.79 • 0.01 • 



0.19 

0.19 + 0.005 



= 0.97 . 



3.2 Probabilistic Interpretations 

In order to integrate probabilities in the learning from interpretation setting, 
we need to find a way to assign probabilities to interpretations covered by an 
annotated logic program. In the past few years, this question has received a lot 
of attention and various different approaches have been developed such as [38]. 
In this paper, we choose Bayesian logic programs [21] as the probabilistic logic 
programming system because Bayesian logic programs combine Bayesian net- 
works [37], which represent probability distributions over propositional interpre- 
tations, with definite clause logic. Furthermore, Bayesian logic programs have 
already been employed for learning. 

The idea underlying Bayesian logic programs is to view ground atoms as 
random variables that are defined by the underlying definite clause programs. 
Furthermore, two types of predicates are distinguished: deterministic and prob- 
abilistic ones. The former are called logical, the latter Bayesian. Likewise we 
will also speak of Bayesian and logical atoms. A Bayesian logic program now 
consists of a set of of Bayesian (definite) clauses, which are expressions of the 
form A j Ai, . . . , An where A is a Bayesian atom, Ai, . . . , An, n > 0, are Bayesian 
and logical atoms and all variables are (implicitly) universally quantified. To 
quantify probabilistic dependencies, each Bayesian clause c is annotated with its 
conditional probability distribtion cpd(c) = P(A j Ai, . . . , An), which quantifies 
as a macro the probabilistic dependency among ground instances of the clause. 

Let us illustrate Bayesian logic programs on Jensen’s stud farm example [19], 
which describes the processes underlying a life threatening heridatary disease. 

Example 5. Consider the following Bayesian clauses: 

carrier(X) j founder(X). (3) 

carrier(X) j mother(M, X), carrier(M), father(F, X), carrier(F). (4) 
suffers(X) j carrier(X). (5) 

They specify the probabilistic dependencies governing the inheritance process. 
For instance, clause (4) says that the probability for a horse being a carrier of 
the disease depends on its parents being carriers. 

In this example, the mother, father, and founder are logical, whereas the 
other ones, such as carrier and suffers, are Bayesian. The logical predicates 
are then defined by a classical definite clause program which constitute the 
background theory for this example. It is listed as interpretation B in Example 2. 
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The conditional probability distributions for the Bayesian clause are 



P(carrier(X) = true) 
06 



carrier(X) 


P(suffers(X) = true) 


true 


0.7 


false 


0.01 



carrier(M) 


carrier(F) 


P(carrier(X) = true) 


true 


true 


0.6 


true 


false 


0.5 


false 


true 


0.5 


false 


false 


0.0 



Observe that logical atoms, such as mother(M, X), do not affect the distribution 
of Bayesian atoms, such as carrier(X), and are therefore not considered in the 
conditional probability distribution. They only provide variable bindings, e.g., 
between carrier (X) and carrier(M). 

By now, we are able to define the covers relation for Bayesian logic programs. 
A set of Bayesian logic program together with the background theory induces 
a Bayesian network. The random variables A of the Bayesian network are the 
Bayesian ground atoms in the least Herbrand model / of the annotated logic pro- 
gram. A Bayesian ground atom, say carrier(alan), influences another Bayesian 
ground atom, say carrier(betsy), if and only if there exists a Bayesian clause 
c such that 

1. carrier(alan) G body(c)6 C /, and 

2. carrier(betsy) = head(c)9 G I. 

Each node A has cpd(c0) as associated conditional probability distribution. If 
there are multiple ground instances in I with the same head, a combining rule 
combine!’} is used to quantified the combined effect. A combining rule is a 
function that maps finite sets of conditional probability distributions onto one 
(combined) conditional probability distribution. Examples of combining rules are 
noisy-or, and noisy-and, see e.g. [19] 

Example 6. The Stud farm Bayesian logic program induces the following 
Bayesian network. 




s (carl) 
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Note that we assume that the induced network is acyclic and has a finite branch- 
ing factor. The probability distribution induced is now 

P(/|i/) = combine{cpd(cd) \body{c)9 C I,head{c)9 = k}. (6) 

Bayesian atom AG/ 

From Equation (6), we can see that an example e consists of a logical part 
which is a Herbrand interpretation of the annotated logic program, and a prob- 
abilistic part which is a partial state assignment of the random variables oc- 
curing in the logical part. 

Example 7. A possible example e in the Stud farm domain is 

{carrier(henry) = false, suffers (henry) = false, carrier(ann) = true, 
suffers(ann) = false, carrier(brian) = false, suf f ers(brian) = false, 
carrier(alan) =?, suf f ers(alan) = false, carrier(alice) = false, 

suffers(alice) = /alse, ...} 



where ? denotes an unobserved state. The covers relation for e can now be 
computed using any Bayesian network inference engine based on Equation (6). 

3.3 Probabilistic Proofs 

The framework of stochastic logic programs [12,31,4] was inspired on research 
on stochastic context free grammars [1,26]. The analogy between context free 
grammars and logic programs is that 

1. grammar rules correspond to definite clauses, 

2. sentences (or strings) to atoms, and 

3. derivations to proofs. 

Furthermore, in stochastic context-free grammars, the rules are annotated with 
probability labels in such a way that the sum of the probabilities associated to 
the rules defining a non-terminal is 1.0 . 

Eisele and Muggleton have exploited this analogy to define stochastic logic 
programs. These are essentially definite clause programs, where each clause c 
has an associated probability label Pc such that the sum of the probabilities as- 
sociated to the rules defining any predicate is 1.0 (though less restricted versions 
have been considered as well [4]). 

This framework now allows to assign probabilities to proofs for a given pred- 
icate q given a stochastic logic program H U B in the following manner. Let Dq 
denote the set of all possible ground proofs for atoms over the predicate q. For 
simplicity reasons, it will be assumed that there is a finite number of such proofs 
and that all proofs are finite (but again see [4] for the more general case). Now 
associate to each proof tq G Dq the value 

Vt = Y[Pc°-* 

c 
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where the product ranges over all clauses c and ric,t denotes the number of 
times clause c has been used in proof tq. For stochastic context free grammars, 
the values Vt correspond to the probabilities of the derivations. However, the 
difference between context free grammars and logic programs is that in grammars 
two rules of the form n — >■ q,ni, and q — >■ qi,...,qk always ’resolve’ to 

give n — >■ qi, ...,qk,ni, ...jUm whereas resolution may fail due to unification. 
Therefore, the probability of a proof tree t in Dq 

P{t\H,B) = ^ • 

The probability of a ground atom is then defined as the sum of all the probabil- 
ities of all the proofs for that ground atom. 

Example 8. Consider a stochastic variant of the definite clause grammer in 
Example 3 with uniform probability values for each predicate. The value 
of the proof (tree) u in Example 3isu„ = ^ ‘ ^ ^ ^ ■ The only 

other ground proofs si,S 2 of atoms over the predicate sentence are those of 
sentence([a, turtle, sleeps], []) and sentence([the, turtle, sleeps], []). Both 
get value Vs^ = too. Because there is only one proof for each of the 

sentences, P(sentence([the, turtles, sleep], [])) = = \ ■ 

At this point, there are at least two different settings for probabilistic induc- 
tive logic programming using stochastic logic programs. The first actually corre- 
sponds to a learning from entailment setting in which the examples are ground 
atoms entailed by the target stochastic logic program. This setting has been 
studied by Cussens [5] , who solves the parameter estimation problem, and Mug- 
gleton [32,33], who presents a preliminary approach to structure learning (con- 
centrating on the problem of adding one clause to an existing stochastic logic 
program) . 

In the second setting, which we sketch below, the idea is to employ learn- 
ing from proofs instead of entailment. This significantly simplifies the structure 
learning process because proofs carry a lot more information about the struc- 
ture of the underlying stochastic logic program than clauses that are entailed 
or not. Therefore, it should be much easier to learn from proofs. Furthermore, 
learning stochastic logic programs from proofs can be considered an extension of 
the work on learning stochastic grammars. It should therefore also be applicable 
to learning unification based grammars. 

4 Probabilistic Inductive Logic Programming 

By now, the problem of probabilistic inductive logic programming can be defined 
as follows: 
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Given ~ a set of examples E, 

— a probabilistic covers relation P{e \ H, B), 

— a probabilistic logic programming representation lan- 
guage, and 

— possibly a background theory B. 

Find the hypothesis H* = argmaxnPiE \ H,B). 

Because H = {L, A) is essentially a logic program L annotated with probabilistic 
parameters A, one distinguishes in general two subtasks: 

1. Parameter estimation, where it is assumed that the underlying logic program 
L is fixed, and the learning task consists of estimating the parameters A that 
maximize the likelihood. 

2. Structure learning, where both L and A have to be learned from the data. 

Below, we briefly sketch the basic parameter estimation and structure learn- 
ing techniques for probabilistic inductive logic programming. A complete survey 
of learning probabilistic logic representations can be found in [9] . 

4.1 Parameter Estimation 

The problem of parameter estimation is thus concerned with estimating the 
values of the parameters A* of a fixed probabilistic logic program L that best 
explain the examples E. So, A is a set of parameters and can be represented as a 
vector. As already indicated above, to measure the extent to which a model fits 
the data, one usually employs the likelihood of the data, i.e. P{E \ L, A), though 
other scores or variants could - in principle - be used as well. 

When all examples are fully observable, maximum likelihood reduces to fre- 
quency counting. In the presence of missing data, however, the maximum like- 
lihood estimate typically cannot be written in closed form. It is a numerical 
optimization problem, and all known algorithms involve nonlinear optimization 
The most commonly adapted technique for probabilistic logic learning is the 
Expectation-Maximization (EM) algorithm [10,28]. EM is based on the observa- 
tion that learning would be easy (i.e., correspond to frequency counting), if the 
values of all the random variables would be known. Therefore, it first estimates 
these values, and then uses these to maximize the likelihood, and then iterates. 
More specifically, EM assumes that the parameters have been initialized (e.g., at 
random) and then iteratively performs the following two steps until convergence: 

E-Step: On the basis of the observed data and the present parameters of the 
model, compute a distribution over all possible completions of each partially 
observed data case. 

M-Step: Using each completion as a fully-observed data case weighted by its 
probability, compute the updated parameter values using (weighted) fre- 
quency counting. 



The frequencies over the completions are called the expected counts. 
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4.2 Structure Learning 

The problem is now to learn both the structure L and the parameters A of the 
probabilistic logic program from data. Often, further information is given as well. 
It can take various different forms, including: 

1. a language bias that imposes restrictions on the syntax of the definite clauses 
allowed in L, 

2. a background theory, 

3. an initial hypothesis {L, A) from which the learning process can start, and 

4. a scoring function score{L, A, E) that may correct the maximum likelihood 
principle for complex hypthesis; this can be based on a Bayesian approach 
which takes into account priors or the minimum description length princple. 

Nearly all (score-based) approaches to structure learning perform a heuristic 
search through the space of possible hypotheses. Typically, hill-climbing or beam- 
search is applied until the hypothesis satisfies the logical constraints and the 
score(iL, E) is no longer improving. The logical constraints typically require that 
the exammples are covered in the logical sense. E.g., when learning stochastic 
logic programs from entailment, the example clauses must be entailed by the 
logic program, and when learning Bayesian logic programs, the interpretation 
must be a model of the logic program. 

At this point, it is interesting to observe that in the learning from entailment 
setting the examples do have to be covered in the logical sense when using the 
setting combining FOIL and naive Bayes, whereas using the stochastic logic 
programs all examples must be logically entailed. 

The steps in the search-space are typically made through the application of 
so-called refinement operators [43,36], which make perform small modifications 
to a hypothesis. From a logical perspective, these refinement operators typi- 
cally realize elementary generalization and specialization steps (usually under 
6*-subsumption) . 

Before concluding, we will now sketch for each probabilistic inductive logic 
learning setting a structure learning algorithm. 



4.3 Learning from Probabilistic Entailment 

One promising approach to address learning from probabilistic entailment is 
to adapt FOIL [41] with the conditional likelihood as described in Equa- 
tion (2) as the scoring function score{L, A, E) (see N. Landwehr, K. Kersting 
and L. De Raedt, forthcoming, for more details). 

Given a training set E containing positive and negative examples (i.e. true 
and false ground facts), this algorithm computes Horn clause features 6i, 62, • ■ • 
in an outer loop. It terminates when no further improvements in the score are 
obtained, i.e, when score{{bi, . . . ,&j},Aj,F) < score{{bi, . . . , 5i_|_i}, Aj+i, F), 
where A denotes the maximum likelihood parameters. A major difference with 
FOIL is, however, that the covered positive examples are not removed. 
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c(X) I f(X). 
c(X) I m(M,X),c(M). 
s(X) I c(X). 



c(X)|f(X). c(X)']<f(X),.-'' c(X)|f(X). 

c(X) I m(M,X). c(X) |jh(M.X),c(M),s(X). c(X) | m(M,X),c(M),f(F,X). 

s(X)|c(X). 5(X)'|c(X). s(X)|c(X). 



Fig. 1. The use of refinement operators during structural search within the framework 
of Bayesian logic programs. We can add an atom or delete an atom from the body 
of a clause. Candidates crossed out illegal because they are cyclic. Other refinement 
operators are reasonable such as adding or deleting logically valid clauses. 



The inner loop is concerned with inducing the next feature top-down, i.e., 
from general to specific. To this aim it starts with a clause with an empty body, 
e.g., muta(M) t— . This clause is then specialised by repeatedly adding atoms to the 
body, e.g., muta(M) t— bond(M, A, 1), muta(M) t— bond(M, A, 1), atom(M, A, c, 22, _), 
etc. For each refinement 6'^^ then compute the maximum-likelihood param- 
eters and score{{bi, . . . , 6'^;^}, if). The refinement that scores best, 
say is then considered for further refinement and the refinement process 
terminates when score({bi, . . . , 6i+i}, A^+i, if) < score{{bi, . . . , A"_^_]^, if). 

Preliminary results with a prototype implementation are promising. 

4.4 Learning from Probabilistic Interpretations 

SCOOBY [22,20,23] is a greedy hill-climbing approach for learning Bayesian logic 
programs. Scooby takes the initial Bayesian logic program H = (L, A) as start- 
ing point and computes the parameters maximizing score{L, X, E). Then, re- 
finement operators generalizing respectively specializing H are used to to com- 
pute all legal neighbours of H in the hypothesis space, see Figure 1. Each 
neighbour is scored. Let H' = {L',X') be the legal neighbour scoring best. If 
score{L, X, E) < score{L' , X' , E) then Scooby takes E[' as new hypothesis. The 
process is continued until no improvements in score are obtained. 

Scooby is akin to theory revision approaches in inductive logic prorgam- 
ming. In case that only propositional clauses are considered, Scooby coincides 
with greedy hill-climbing approaches for learning Bayesian networks [16]. 



4.5 Learning from Probabilistic Proofs 

Given a training set E containing ground proofs as examples, one possible 
approach combines ideas from the early inductive logic programming system 
Golem [34] that employs Plotkin’s [39] least general generalization (LGG) with 
bottom-up generalization of grammars and hidden Markov models [46]. The 
resulting algorithm employs the likelihood of the proofs score{L, A, E) as the 
scoring function. It starts by taking as Lq the set of ground clauses that have 
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been used in the proofs in the training set and scores it to obtain Aq. After in- 
tialization, the algorithm will then repeatedly select a pair of clauses in Li, and 
replace the pair by their LGG to yield a candidate L' . The candidate that scores 
best is then taken as i?i+i = and the process iterates untill the 

score no longer improves. One interesting issue is that strong logical constraints 
can be imposed on the LGG. These logical constraints directly follow from the 
fact that the example proofs should still be valid proofs for the logical compo- 
nent L of all hypotheses considered. Therefore, it makes sense to apply the LGG 
only to clauses that define the same predicate, that contain the same predicates, 
and whose (reduced) LGG also has the same length as the original clauses. More 
details on this procedure will be worked out in a forthcoming paper (by S. Torge, 
K. Kersting and L. De Raedt). 



5 Conclusions 

In this paper, we have presented three settings for probabilistic inductive logic 
programming: learning from entailment, from interpretations and from proofs. 
We have also sketched how inductive logic programming and probabilistic learn- 
ing techniques can - in principle - be combined to address these settings. Nev- 
ertheless, more work is needed before these techniques will be as applicable as 
traditional probabilistic learning or inductive logic programming systems. The 
authors hope that this paper will inspire and motivate researchers in probabilis- 
tic learning and inductive logic programming to join the exciting new field lying 
at the intersection of probabilistic reasoning, logic programming and machine 
learning. 
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Appendix: Logic Programming Concepts 

A first order alphabet is a set of predicate symbols, constant symbols and functor 
symbols. A definite clause is a formula of the form A T- Bi , ..., B„ where A and 
Bi are logical atoms. An atom p{ti, ..., tn) is a predicate symbol p/n followed by 
a bracketed n-tuple of terms fi. A term t is a variable V or a function symbol 
f(ti, ...,tk) immediately followed by a bracketed n-tuple of terms fi. Constants 
are function symbols of arity 0. Functor-free clauses are clauses that contain 
only variables as terms. The above clause can be read as A if Bi and ... and 
All variables in clauses are universally quantified, although this is not explicitly 
written. We call A the head of the clause and Bi, ..., Bn the body of the clause. 
A fact is a definite clause with an empty body, (to = 1, n = 0). Throughout 
the paper, we assume that all clauses are range restricted, which means that all 
variables occurring in the head of a clause also occur in its body. A substitution 
9 ={Vi <— ti,...,Vk ^ tk} is an assignment of terms to variables. Applying a 
substitution 0 to a clause, atom or term e yields the expression eO where all 
occurrences of variables Vi have been replaced by the corresponding terms. A 
Herbrand interpretation is a set of ground facts over an alphabet A. A Herbrand 
interpretation / is a model for a clause c if and only if for all 0 such that 
body{c)9 C / — >• head{c)9 G I; it is a model for a set of of clauses H if and only 
if it is a model for all clauses in H. We write H \= e if and only if all models of 
H are also a model of e. 
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Abstract. A hidden Markov model is introduced for descriptive mod- 
elling the mosaic-like structures of haplotypes, due to iterated recom- 
binations within a population. Methods using the minimum description 
length principle are given for fitting such models to training data. Pos- 
sible applications of the models are delineated, and some preliminary 
analysis results on real sets of haplotypes are reported, demonstrating 
the potential of our methods. 



1 Introduction 

Hidden Markov models (HMMs) have become a standard tool in biological se- 
quence analysis [8,2]. Typically they have been applied to modelling multiple se- 
quence alignments of protein families and protein domains as well as to database 
searching for, say, predicting genes. 

In this paper we introduce HMM techniques for modelling the structure of 
genetic variation between individuals of the same species. Such a variation is 
seen in so-called haplotypes that are sequences of allelic values of some DNA 
markers taken from the same DNA molecule. The single nucleotide polymor- 
phisms (SNPs) are important such markers, each having two alternative values 
that may occur in haplotypes of different individuals. The SNPs cover for ex- 
ample the human genome fairly densely. The variation of haplotypes is due to 
point mutations and recombinations that take place during generations of evolu- 
tion of a population. Studying the genetic variations and correlating them with 
variations in phenotype is the commonly followed strategy for locating disease 
causing genes and developing diagnostic tests to screen people having high risk 
for these diseases. 

Several recent studies have uncovered some type of block structure in human 
haplotype data [1,10,4,18,11,6]. However, the recombination mechanism as such 
does not necessarily imply a global block structure. Rather, at least in old popu- 
lations one expects to see a mosaic-like structure that reflects the recombination 
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history of the population. Assume that the haplotypes of an observed popu- 
lation have developed in generations of recombinations from certain ’founder’ 
haplotypes. Then the current haplotypes should consist of conserved sequence 
fragments (possibly corrupted by some rare later mutations) that are taken from 
the founder sequences. Our goal is to uncover such conserved fragments from a 
sample set of haplotypes. 

Unlike most earlier approaches for analyzing haplotype structures, the model 
class introduced here has no bias towards a global block structure. Our model 
is an acyclic HMM, capable of emitting equal length sequences. Each state of 
the model can emit a sequence fragment that is to be put into some specific 
location of the entire sequence. As the location is independent of the locations 
of other states, the madel has no global block structure. The hidden part of 
each state represents a conserved fragment, and the transitions between the 
states model the cross-overs of recombinations. We distinguish a general variant 
whose transition probabilities depend on both states involved, and a ’simple’ 
variant whose transition probabilities depend only on the state to be entered. 

For selecting such a HMM for a given training data we suggest a method 
based on the minimum description length principle (MDL) by Rissanen [12,13] 
which is widely used in statistics, machine learning, and data mining [9,5]. An 
approximation algorithm will be given for the resulting optimization problem of 
finding a model with shortest description in the proposed encoding scheme. The 
algorithm consists of two parts that are iterated alternatingly. The greedy part 
reduces the set of the conserved fragments, and the optimizer part uses the usual 
expectation maximization algorithm for finding the transition probabilities for 
the current set of fragments. 

Our idea of block-free modeling has its roots in [17] which gave a combi- 
natorial algorithm for reconstructing founder sequences without assuming block 
structured fragmentation of the current haplotypes. In probabilistic modelling 
and using MDL for model selection we follow some ideas of [6]. Unfortunately 
in the present block-free case the optimization problem seems much harder. Re- 
cently, Schwartz [14] proposed a model basically similar to our simple model 
variant. However, his method for model selection is different from ours. A non- 
probabilistic variant of our model with an associated learning problem was in- 
troduced in [7]. 

The rest of the paper is organized as follows. In Section 2 we describe the 
HMMs for modelling haplotype fragmentations. Section 3 gives MDL methods 
for selecting such models. Section 4 discusses briefly the possible applications of 
our models. As an example we analyze two real sets of haplotypes, one from lac- 
tose tolerant and the other from lactose intolerant humans. Section 5 concludes 
the paper. 

We assume basic familiarity with HMM techniques as described e.g. in [2]. 
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2 Fragmentation Models 

2.1 Haplotype Fragmentation 

We want to model by a designated HMM the haplotypes in a population of a 
species of interest. The haplotypes are over a fixed interval of consecutive genetic 
markers, say m markers numbered 1,2, ... ,m from left to right. The markers 
may be of any type such as biallelic SNPs (single nucleotide polymorphisms) 
or multiallelic microsatellite markers. The alleles are the possible alternative 
’values’ a marker can get in the DNA of different individuals of the same species. 
A biallelic SNP has two possible (or most frequently occurring) values that 
actually refer to two alternative nucleotides that may occur in the location of 
the SNP in the DNA sequence. Hence each marker i has a corresponding set Ai 
of possible alleles, and each haplotype over the m markers is simply a sequence 
of length m in Ai X . . . X Am- 

Our HMM will be based on the following scanario of the structure of the 
haplotypes as a result of the microevolutionary process that causes genetic vari- 
ation within a species. We want to model the haplotypes of individuals in a 
population of some species such as humans. Let us think that the observed pop- 
ulation was founded some generations ago by a group of ’founders’. According to 
the standard model of DNA microevolution, the DNA sequences of the current 
individuals are a result of iterated recombinations of the DNA of the founders, 
possibly corrupted by point mutations that, however, are considered rare. Simply 
stated, a recombination step produces from two DNA sequences a new sequence 
that consists of fragments taken alternatingly from the two parent sequences. 
The fragments are taken from the same locations of the parents as is their tar- 
get location in the offspring sequence. Hence the nucleotides of the parent DNA 
shuffle in a novel way but retain their locations in the sequence. 

The haplotypes reflect the same structure as they can be seen as subsequences 
obtained from the full DNA by restriction to the markers. So, if haplotype R is 
a recombination of haplotypes G and H, then all three can be written for some 
c > 0 as 



R = GiHiG2H2---GcHa 

/x-y/ /^/ 

— LtiLt]^Lt2^2 ' ' ' 

H = H[HiH!^H2 ■ ■ ■ iL'Hc 

where |Gj| = \H-\ > 0 for 1 < i < c, and |G'| = \Hi\ > 0 for 1 < i < c and 
|G(,| = I He I > 0- Haplotype R has a cross-over between markers i and z -I- 1 if 
the markers do not belong to the same fragment Gj or Hj for some j. 

Assume that such recombination steps are applied repeatedly on an evolving 
set of sequences, starting from an initial set of founder haplotypes. Then a hap- 
lotype of a current individual is built from conserved sequence fragments that 
are taken from the haplotypes of the founders. In other words, each haplotype 
has a parse 



/ 1/2 ■ • ■ fh 
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where each fi is a contiguous fragment taken from the same location of some 
founder haplotype, possibly with some rare changes due to point mutations. This 
parse is unknown. Our goal is to develop HMM modelling techniques that could 
help uncovering such parses as well as conserved haplotype segments. 

To that end, we will introduce a family of HMMs whose states model the 
conserved sequence fragments and the transitions between the states model the 
cross-overs. The conserved fragments in the models will be taken from the se- 
quences in Al X • • • X Ajn. The parameters of the HMM will be estimated from 
a training data that consists of some observed haplotypes in the current pop- 
ulation. We will apply the minimum description length principle for the model 
selection to uncover the fragments that can be utilized in parsing several different 
haplotypes of the training data. Our model does not make any prior assumption 
on the distribution of the cross-over points that would prefer, say, a division of 
the haplotypes into global blocks between recombination ’hot spots’. 



2.2 Model Architecture 

A hidden Markov model M = {F, e, W) for modeling haplotype fragmentation 
consists of a set F of the states of M, the error parameter e > 0 of M, and the 
transition probabilities W between the states of M. 

Each state f G F is actually a haplotype fragment by which we mean con- 
tiguous segment of a possible haplotype between some start marker and some 
end marker, that is, / is an element of As x ■■■ x Ag where 1 < s < e < to. We 
often call the states / the fragments of M. The start and end markers of / are 
denoted by s(/) and e(/), respectively. 

The emission probabilities P{d\f) of state (fragment) / give a probability 
distribution for fragments d G ^s(f) x ■ ■ ■ x when the underlying founder 

fragment is /. This distribution will include a model for mutation and noise 
rates, specified by the noise parameter e. It is also possible to incorporate a 
model for missing data which is useful if the model training data is incomplete. 
We adopt perhaps the simplest and most practical alternative, the missing-at- 
random model. Let us write / = fs - ■ ■ fe and d = dg ■ ■ ■ de- Assuming that the 
markers are independent of each other we have 

e 

p{d\f) = i[p{dm, 

i—s 



where we define 



p{dm 



1 if di is missing 

1 - e if d, = /i 

e/(|Aj| — 1) otherwise. 



We assume that a value for e is a given constant and do not consider here its 
estimation from the data. Therefore we omit e from the notation and denote a 
fragmentation model just as M = {F, W). 
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Fig. 1. A simple fragmentation model for haplotypes with 6 binary markers 

Model M has a transition from state / to state g whenever / and g are 
adjacent, that is, if e(/) + 1 = s{g). If this is the case, the transition probablity 
W{f,g) is defined, otherwise not. The probabilities of the transitions from / 
must satisfy = 1- 

A transition function W defined this way has the drawback of fairly large 
number of probabilities to estimate. Therefore we consider the following simpli- 
fied version. A fragmentation model is called simple if all probabilities W (/, g) 
for a fixed g but varying / are equal. Hence g is entered with the same probabil- 
ity, independently of the previous state /. Then we can write W{f,g) = W{g), 
for short. A simple fragmentation model is specified by giving the fragments F, 
the transition probability W (g) for each g G F, and the error parameter e. From 
now on, in the rest of the paper, we only consider the simple fragmentation 
models. 

An example of a simple fragmentation model over 6 (binary) markers is shown 
in Figure 1. The fragments for the states are shown inside each rectangle, the 
height of which encodes the corresponding transition probability. For example, 
each of the three states whose start marker is 1 have transition probability 1/3, 
and the three states with start marker 3 have transition probabilities 2/3, 1/6, 
and 1/6. When e = 0 this model generates haplotype 000000 with probability 
l/3-2/3-l/2 = l/9 (the only path emitting this sequence with non-zero prob- 
ability goes through the three states with fragment 00), and haplotype 111111 
with probability 0. 



2.3 Emission Probability Distribution 

Let us recall how a hidden Markov model associates probabilities to the sequences 
it emits. A path through a simple fragmentation model M = (F, W) over m 
markers is any sequence {Fi, . . . , F^) of states of M such that s{Fi) = 1, e{Fi) + 
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1 = s{F,+^) for 1 < i < h, and e{Fh) = m. Let tt be the set of all paths through 
M. 

Consider then the probability of emitting a haplotype H = Fli ■ ■ ■ Flm- Some 
allele values may be missing in FI . The probability that is emitted from path 
{Fi,...,Fh) is 

h 

P{H, (Fi, . . . , Fh)\M) = n W{F,)P{H,^p,) ■ ■ ■ 

which simply is the probability that the path is taken and each state along the 
path emits the corresponding fraction of H, the emission probabilities being as 
already defined. The probability that M emits FI is then 

P{H\M)= ^ P{H,{Fu...,F,,)\M). (1) 

(Fi,...,P’fc)G7r 

3 MDL Method for Model Selection 

3.1 Description Length 

Let D be our training data consisting of n observed haplotypes Di, . . . , Dn 
over m markers. The minimum description length (MDL) principle of Rissanen 
considers the description of the data D using two components: description of 
the model M and description of the data D given the model. Hence the total 
description length for the model and the data is 

L{M,D) = L{M) + L{D\M) 

where L{M) is the length of the description of M and L{D\M) is the length of 
the description of D when D is described using M. 

The MDL principle states that the desired descriptions of the data are the 
ones having the minimum length L{M,D) of the total description. For a survey 
of the connections between MDL, Bayesian statistics, and machine learning see 

[9,5]. 

To apply this principle we have to fix the encoding scheme that will give 
L{M) and L{D\M) for each particular M and D. 

The model M = (F, W) can be described by telling what are the fragments 
Fi in F and where they start. The transition probabilities W{Fi) should also be 
given. The error parameter e is assumed constant and hence it needs not to be 
encoded. 

To encode fragment Fi we use 

<Fi) 

L{Fi) = ^ log \Aj\+ log m 
j=AFi) 

bits where log \Aj\ bits are used for representing the corresponding allele of Fi, 
and logm bits are used for s{Fi). The probabilities fF(Fi) are real numbers. 
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Theoretical arguments [5] indicate that an appropriate coding precision is ob- 
tained by choosing to use | log n bits for each independent real parameter of the 
model; recall that n is the size of training data D. In our case there are |F| — t 
independent probability values where t denotes the number of different start 
markers of the fragments in F. One of the probabilities W (Fi) for fragments Ft 
with the same s{Fi) namely follows from the others as their sum has to equal 1. 
Thus the total model length becomes 

lO lpl_, 

L(M) = ^T(F,) + ^— logn 

1^1 \F\_f 

= ^( ^ log\Aj\ + log m) -h — log n. (2) 

i=l j=s(Fi) 

Using the relation of coding lengths and probabilities [5], we use 

n n 

L{D\M) = -logn^(AlM) = -^logP(AlM) (3) 

i=l 

bits for describing the data D, given the fragmentation model M. Here proba- 
bilities P{Di\M) can be evaluated using (1). 

We are left with designing an algorithm that solves the MDL minimization 
problem. One has to find an M = (F,W) for a fixed error parameter e that 
minimizes L{M) + L{D\M) for a given data D. 

3.2 Greedy Algorithms for MDL Optimization 

As solving the MDL optimization exactly in this case seems difficult we give a 
greedy approximation algorithm. The algorithm has to solve two tasks that are 
to some degree independent. First, one has to find a good fragment set F for 
the model. Note that the model length L{M) depends only on F (and on n). 
Second, for a fixed F one has to find a good W such that L{D\M) is minimized. 

Let us consider the second task first. Given F and D, we will use the well- 
known expectation maximization (EM) algorithm for finding W that gives (lo- 
cally) maximum P{D\M) from which we get minimum L(D\M); see e.g. [2, pp 
63-64]. The EM algorithm, delinated as Algorithm 1 below, starts with some ini- 
tial W and then computes the expected number of times each state of the model 
with the current W is used when emitting D. These values are normalized to 
give updated W . The process is repeated with the new W until convergence 
(or a given number K of times). When converged, local maximum likelihood 
estimates for W have been obtained. 

Algorithm 1. EM-algorithm for finding W for a model M = {F, W). 



1. Initialize W somehow. 
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2. (E-step) For each training data sequence Di G D and fragment f G F, 
compute <7i(/) as the probability that M emits Di^g(^fy ■ ■ from /. 



3. 



This can be done using the standard Forward and Backward algorithms for 
HMMs. 

(M-step) For each fragment /, let q{f) G- 9*(/)- Finally set 



W{f)^ 



<?(/) 



Er 



f'eF(s{f)) 



?(/') 



where F{s{f)) denotes the subset of F that consists of all fragments having 
the same start marker as /. 

4. Repeat steps 2 and 3 until convergence (or until a given number of iterations 
have taken). 

We will denote by EM{F) the W obtained by Algorithm 1 for a set F of frag- 
ments. 

To analyze the running time of Algorithm 1, assume that we have precom- 
puted the probabilities «(/)••• T^i,e(/) I/) for each f G F and Di G D. 
Then the Forward and Backward algorithms in step 2 just spend a constant 
time per each fragment when they scan F from left to right and from right to 
left, respectively. This happens for each Di. Hence step 2 needs time 0{n\F\). 
This obviously dominates the time requirement of step 3, too. So we obtain the 
following remark. 



Proposition 1. Algorithm 1 ( EM algorithm ) for M = {F, W) takes time 
0{Kn\F\) where K is the number of iterations taken. 



Let us then return to the first task, selecting fragment set F. According to 
our definition of fragmentation models, set F should be selected from the set of 
all possible fragments of the sequences in Ai x • • • x Am- As this search space is of 
exponential sixe in m, we restrict the search in practice to the fragments that are 
present in D. Let us denote as = {Dij ■ ■ ■ Di^k \F>iGD,l<j<k< m} 
the initial set of frgments obtained in this way. Then \<P{D) \ = 0{nmf). 

We select F from ^{D) using a simple greedy strategy: delete from <P{D) the 
fragment whose elimination maximally improves the score L{M) + L{D\M). If 
a fragment is deleted, then one has to remove also all other fragments that the 
deletion makes isolated. A fragment is isolated if no path through the model 
can contain it, which is easy to test. Repeat deletions until the score does not 
improve. The remaining set of fragments is F. This method is given in more 
detail as Algorithm 2. 



Algorithm 2. Basic greedy MDL learning 

Notation: M{—f) = the model that is obtained from the current model M = 
{F, W) by deleting fragment / and all fragments that become isolated as a side- 
effect of the removal of / from F and by updating W by Algorithm 1 such that 
L{D\M{—F)) is minimal. 
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1 Initialize M = {F, W) as F ^ <P{D)-, W ^ EM(<P(D)) 

2 Return Greedy(M) where procedure Greedy is as follows. 

procedure Greedy(M), where M = (F,W) 

do 

L ^ L{M) + L{D\M) 

A ^ maxyef (L - [L(M(-/)) + Lp|M(-/))]) 
fA ^ argmax/g_F(i - [L{M{-f)) + L{D\M{-f))]) 
if Z\ > 0 then M ^ M{—Ja) 

until Z\ < 0 
return M 

end Greedy 

To implement Algorithm 2 efficiently one has to precompute the coding 
lengths of all the fragments in 'P{D). This can be done in time 0(|^(D)|) as the 
length only depends on the location of the fragment but not on the content. Then 
L{M) can be evaluated for any M = {F, W) in time 0(|F|) = 0{\<1>{D)\). Train- 
ing a new M{—f) by Algorithm 1 can be done in time 0{Kn\F\) = 0{Kn\<P{D)\) 
where K is the parameter limiting the number of iterations. To find A and /a 
in procedure Greedy, a straightforward implementation just tries each f G F, 
taking time 0{Kn\<P{D)\'^). This will be repeated until nothing can be deleted 
from the set of fragments, i.e., 0{\<P{D)\) times. The total time of Algorithm 2 
hence becomes 0{Kn\<l^{D)\^) = 0{Kn‘^m^). 

As Algorithm 2 can be practical only for very small m and n, we next de- 
velop a faster incremental version of the greedy MDL training. This algorithm 
contructs intermediate models using the initial segments of the sequences in D, 
in increasing order of the length. A model, denoted Mj^i for the initial segments 
of length j + 1 will be constructed by expanding and retraining the model Mj 
obtained in the previous phase for the initial segments of length j. 

Let ^j{D) = {Di^k ■ • ■ Dij \ Di G D,1 < k < j} he the set of fragments of 
D that end at marker j. To get M,+i, fragments <l>j+i{D) are added to Mj and 
then the useless fragments are eliminated as in Algorithm 2 by procedure Greedy 
to get Mj+i. Adding the fragments in such smaller portions to the optimization 
leads to a faster algorithm. 

A detail needs additional care. Adding <Pjj^i{D) alone to the set of fragments 
may introduce isolated fragments that have no possibility to survive in the MDL 
optimization because they are never reached although they could be useful in 
encoding the data. To keep the new set of fragments connected we therefore 
also add fragments that bridge the gaps between the old fragments inherited 
from Mj and the new fragments in We use the following bridging 

strategy that by adding only the shortest bridges keeps the number of bridging 
fragments relatively small. We say that the set '^{F) of the gaps of a fragment 
set F consists of all pairs (fc, h) of integers such that k — 1 = e(/) for some 
f G F but there is no / G T’ such that k < e(/) < h. Then define <Pj{D,F) = 
^j{D) U {Di^k ■ ■ ■ Di,h \ Di G D, {k, h) G j{F),h < j}. Now, we add for the new 
round the set Fj) where Fj is the fragment set of Mj. 
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Algorithm 3. Incremental greedy MDL learning 
Notation: Procedure Greedy as in Algorithm 2. 

1. Initialize M = {F, W) as F ^ 0; IP ^ 0 

2. for j ^ 1, . . . , m do 

(F, W) ^ Greedy(F U F), EM{F U F)) 

3. return M = (F, IP). 

For a running time analysis of Algorithm 3 assume that the greedy MDL 
optimization is able to reduce the the size of the fragment set in each phase to 
0{mn). This is a plausible assumption as the size of D is mn. Then the size 
of each F U ^j{D, E) given to procedure Greedy in step 2 stays 0{mn). Hence 
Greedy generates 0{m?n^) calls of Algorithm 1, each taking time 0{Kmn^). 
This is repeated m times, giving altogether 0{m^n^) calls of Algorithm 1 and 
total time 0{Km'^n‘^) for Algorithm 3. 

Adding new fragments into the optimization in still smaller portions can in 
some cases give better running times (but possibly at the expense of weaker 
optimization results). One can for example divide 'l’j{D) into fractions <Pk,j{D) 
consisting of equally long fragments which start at marker k and end at j. The 
greedy MDL optimization is performed after adding the next <Ek^{D) and the 
necessary bridging fragments. 



4 Using the Fragmentation Models 

Once we have trained a model M, the numerous possibilities of applying such 
an HMM become available. We delineate here some of them. 

1. Parsing haplotypes. Given some haplotype FI, the path through M that has 
the highest emission probability of H is called the Viterbi path of H. Such a 
path can be efficiently found by standard dynamic programming, e.g. [2]. The 
fragments on this path give a natural parsing of H in terms of the fragments 
of M. The parse is the most probable decomposition of H into conserved pieces 
as proposed by M . Such a parse can be visualized by associating a unique color 
with each fragment of M , and then showing the parse of H with the colors of 
the corresponding fragments. Using the same color for conserved pieces seems a 
natural idea, independently proposed at least in [15,17]. 

2. Cross-over and fragment usage probabilities. The probability that M assigns 
a cross-over between the markers i and i + 1 of haplotype H can be computed 
simply as the fraction between the probability of emitting H along paths having 
a cross-over in that point and the total probability P{H\M) of emitting H 
along any path. Similarly, the probability that a certain fragment of M is used 
for emitting H is the fraction between the emission probability of H along paths 
containing this fragment and P{H\M). 
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3. Comparisons between populations and case/control studies. Assume that we 
have available training data sets from different populations. Then the corre- 
sponding trained models M* can be used for example for classifying new hap- 
lotypes H on the basis of emission probabilities The structure of the 

models M* may also uncover interesting differences between the populations. 
For example, some strong fragments may be characteristic of a certain popula- 
tion but missing elsewhere. Also, the average length of the fragments should be 
related to the age of the population. An older population has experienced more 
recombinations, hence its fragments should be shorter on average as those of a 
younger population. 

To demonstrate our methods we conclude by analyzing two real datasets 
related to the lactase nonpersistence (lactose intolerance) of humans [16]. Lactase 
nonpersistence limits the use of fresh milk among adults. The digestion of milk is 
catalyzed by an enzyme lactase (also called lactase-phlorizin hydrolase or LHP). 
Lactase activity is high during infancy but in most mammals declines after the 
weaning phase. In some healthy humans, however, lactase activity persists at 
high level throughout adult life, and this is known as lactase persistence. People 
with lactase nonpersistence have a much lower lactose digestion capacity than 
those with lactase persistence. A recent study [3] of a DNA region containing 
LCT, the gene encoding LHP, revealed that in certain Finnish populations a 
DNA variant (that is, an SNP), C/T_j^gg^Q, almost 14 kb upstream from the 
LCT locus, completely associates with verified lactase nonpersistence. 

The datasets we analyse consist of haplotypes over 23 SNP markers in the 
vicinity of this particular SNP. The first dataset I?+, the persistent haplotypes, 
consists of 38 haplotypes of lactase persistent individuals, and the second dataset 
D_, the nonpersistent haplotypes, consists of 21 haplotypes of lactase nonper- 
sistent individuals. We trained models for datasets D_, D+, and D+ UD_ using 
Algorithm 3 with e = 0.001 and K = 3 iterations of the EM algorithm. Figures 
2, 3, and 4 show the result. The fragments are shown as colored rectangles. Their 
height encodes the transition probability somewhat differently from the encod- 
ing used in Fig. 1. Here the height of fragment / gives the value W{f) = q{f)/n 
where q{f) is as in Algorithm 1 after the last iteration taken, and n is the number 
of haplotypes in the training data. Hence the height represents the avarage us- 
age of / when the model emits the training data. The measure W{f) describes 
the importance of each fragment for the training data better than the plain 
transition probability W{f). 

We observe that the model for persistent haplotypes has on average longer 
fragments than the model for nonpersistent haplotypes. This suggests, consis- 
tently with common belief, that lactase persistence is the younger of the two 
traits. When analyzing the Viterbi paths in the model for U , we observed 
that all such paths for persistent haplotypes go through the large fragment in 
the middle of the model (number 4 from the top), while none of the Viterbi 
paths for the nonpersistent haplotypes includes it. 

We also sampled two thirds of D+ and D_ and trained models for the 
two samples. Let M+ and M_ be the two models obtained. We then com- 
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Icti MDLe = 0.001 




Fig. 2. A fragmentation model for nonpersistent haplotypes. 
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Fig. 3. A fragmentation model for persistent haplotypes. 
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lct1+2 MDLe = 0.001 




Fig. 4. A fragmentation model for the union of persistent and nonpersistent haplotypes 
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puted for all test haplotypes H outside the training samples the quantity 
Q{H) = logio P{H\M+) / P{H\M-). One obviously expects that Q{H) > 0 when 
i? is a persistent haplotype and Q{H) < 0 otherwise. Satisfyingly we found out 
in this experiment, that Q{H) varied from 6.0 to 11.8 for persistent test haplo- 
types and from -2.6 to -44.7 for nonpersistent test haplotypes. 

5 Conclusion 

While our experimental evaluation of the methods suggests that useful results 
can be obtained in this way, many aspects still need further work. More exper- 
imental evaluation on generated and real data is necessary. We only used the 
simple models but also the general case should be considered. The approxima- 
tion algorithm for the MDL learning seems to work quite robustly but theoretical 
analysis of its performance is missing. Also faster variants of the learning algo- 
rithm may be necessary for larger data. It may also be useful to relax the model 
such that the fragments of the model are not required to cover the haplotypes 
entirely but small gaps between them are allowed. 
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AI systems must be able to learn, reason logically, and handle uncertainty. While 
much research has focused on each of these goals individually, only recently 
have we begun to attempt to achieve all three at once. In this talk, I describe 
Markov logic, a representation that combines the full power of first-order logic 
and probabilistic graphical models, and algorithms for learning and inference in 
it. Syntactically, Markov logic is first-order logic augmented with a weight for 
each formula. Semantically, a set of Markov logic formulas represents a probabil- 
ity distribution over possible worlds, in the form of a Markov network with one 
feature per grounding of a formula in the set, with the corresponding weight. 
Formulas and weights are learned from relational databases using inductive logic 
programming and iterative optimization of a pseudo- likelihood measure. Infer- 
ence is performed by Markov chain Monte Carlo over the minimal subset of the 
ground network required for answering the query. Experiments in a real-world 
university domain illustrate the promise of this approach. 

This work, joint with Matthew Richardson, is described in further detail in 
Domingos and Richardson [1] and Richardson and Domingos [2]. 
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Abstract. In this paper we introduce a paradigm for learning in the 
limit of potentially infinite languages from all positive data and nega- 
tive counterexamples provided in response to the conjectures made by 
the learner. Several variants of this paradigm are considered that re- 
flect different conditions/constraints on the type and size of negative 
counterexamples and on the time for obtaining them. In particular, we 
consider the models where 1) a learner gets the least negative counterex- 
ample; 2) the size of a negative counterexample must be bounded by 
the size of the positive data seen so far; 3) a counterexample may be 
delayed. Learning power, limitations of these models, relationships be- 
tween them, as well as their relationships with classical paradigms for 
learning languages in the limit (without negative counterexamples) are 
explored. Several surprising results are obtained. In particular, for Gold’s 
model of learning requiring a learner to syntactically stabilize on correct 
conjectures, learners getting negative counterexamples immediately turn 
out to be as powerful as the ones that do not get them for indefinitely 
(but finitely) long time (or are only told that their latest conjecture is 
not a subset of the target language, without any specific negative coun- 
terexample). Another result shows that for behaviourally correct learning 
(where semantic convergence is required from a learner) with negative 
counterexamples, a learner making just one error in almost all its con- 
jectures has the “ultimate power” : it can learn the class of all recursively 
enumerable languages. Yet another result demonstrates that sometimes 
positive data and negative counterexamples provided by a teacher are 
not enough to compensate for full positive and negative data. 



1 Introduction 

Defining a computational model adequately describing learning languages is an 
important long-standing problem. In his classical paper [Gol67], M. Gold intro- 
duced two major computational models for learning languages. One of them, 
learning from texts, assumes that the learner receives all positive language data, 
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i.e., all correct statements of the language. The other model, learning from infor- 
mants, assumes that the learner receives all correct statements of the languages, 
as well as all other (incorrect) statements, appropriately labeled as incorrect, 
that can be potentially formed within the given alphabet. In both cases, a suc- 
cessful learner stabilizes at a correct description of the target language, i.e., a 
grammar for the target language. J. Barzdin [B74] and J. Case and C. Smith 
[CS83] introduced a different, more powerful model called behaviorally correct 
learning. A behaviorally correct learner almost always outputs conjectures (not 
necessarily the same) correctly describing the target language. An important 
feature of all these models is that they describe a process of learning in the 
limit: the learner stabilizes to the correct conjecture (or conjectures), but does 
not know when it happens. The above seminal models, doubtless, represent cer- 
tain important aspects of the process of learning potentially infinite targets. On 
the other hand, when we consider how a child learns a language communicating 
with a teacher, it becomes clear that these models reflect two extremes of this 
process: positive data only is certainly less than what a child actually gets in the 
learning process, while informant (the characteristic function of the language) 
is much more than what a learner can expect (see for example, [BH70,HPTS84, 
DPS86]). 

D. Angluin, in another seminal paper [Ang88], introduced a different impor- 
tant learning paradigm, i.e., learning from queries to a teacher (oracle). This 
model, explored in different contexts, including learning languages (see, for ex- 
ample, [LNZ02]), addresses a very important tool available to a child (or any 
other reasonable learner), i.e., queries to a teacher. However, in the context of 
learning languages, this model does not adequately reflect the fact that a learner, 
in the long process of acquisition of a new language, potentially gets access to 
all correct statements. (Exploration of computability via queries to oracles has a 
long tradition in the theory of computation in general [Rog67,GM98], as well as 
in the context of learning in the limit [GP89,FGJ“*'94,LNZ02]. Whereas in most 
cases answers to queries are sometimes not algorithmically answerable - which is 
the case in our model, or computationally NP or even harder - as in [Ang88] , ex- 
ploring computability or learnability via oracles often provides a deeper insight 
on the nature and capabilities of both) . 

In this paper, we combine learning from positive data and learning from 
queries into a computational model, where a learner gets all positive data and can 
ask a teacher if a current conjecture (a grammar) does not generate wrong state- 
ments (questions of this kind can be formalized as subset queries, cf. [Ang88]). 
If the conjecture does generate a wrong statement, then the teacher gives an 
example of such a statement (a negative counterexample) to the learner. In our 
main model, we assume that the teacher immediately provides a negative coun- 
terexample if it exists. However, in many situations, a teacher may obviously 
need a lot of time to determine if the current conjecture generates incorrect 
statements. Therefore, we consider two more variants of our main model that 
reflect this problem. In the first variant, the teacher is not able to provide a 
negative counterexample unless there is one whose size does not exceed the size 
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of the longest statement seen so far by the learner. In the second variant (not 
considered in this version due to space restrictions), the teacher may delay pro- 
viding a negative counterexample. Interestingly, while the former model is shown 
to be weaker than the main model, the latter one turns out to be as powerful as 
the main model (in terms of capabilities of a learner; we do not discuss related 
complexity issues - such as how providing counterexamples quickly may speed 
up convergence to a right conjecture)! 

Our goal in this paper is to explore the new models of learning languages, 
their relationships, and how they fair in comparison with other popular learning 
paradigms. In particular, we explore how quality and availability of the nega- 
tive counterexamples to the conjectures affects learnability. Note that learning 
from positive data and a finite amount of negative information was explored 
in [BCJ95]. However, unlike arbitrary negative counterexamples in our model, 
negative data in [BCJ95] is preselected to ensure that just a small number of 
negative examples (or, just one example) can greatly enhance capabilities of a 
learner. [Shi86] and [Mot92] also considered restricted models of learning from 
negative examples. 

The paper is structured as follows. In Section 2 we introduce necessary no- 
tation and basic definitions needed for the rest of the paper. In particular, we 
define some variants of the classical Gold’s model of learning from texts (positive 
data) and informants (both positive and negative data), TxtEx and InfEx, as 
well as its behaviorally correct counterpart TxtBc and InfBc. 

In Section 3 we define our three models for learning languages from texts 
and negative counterexamples. In the first, basic, model, a learner is provided 
a negative counterexample every time when it outputs a hypothesis containing 
elements not belonging to the target language. The second model is a variant 
of the basic model when a learner receives the least negative counterexample. 
The third model takes into account some complexity constraints. Namely, the 
learner receives a negative counterexample only if there exists one whose size is 
bounded by the size of the longest positive example seen in the input so far. (In 
the full version of the paper we consider also a model which slightly relaxes the 
constraint of the model three: the size of the negative counterexample must be 
bounded by the value of some function applied to the size of the longest positive 
example in the input.) 

Section 4 is devoted to Ex-style learning from positive data and negative 
counterexamples: the learner eventually stabilizes to a correct grammar for the 
target language. First, in order to demonstrate the power of our basic model, we 
show that any indexed class of recursively enumerable languages can be learned 
by a suitable learner in this model. Then we show that the second model is equiv- 
alent to the basic model: providing the least negative counterexample does not 
enhance the power of a learner. Our next major result (Theorem 4) is somewhat 
surprising: we show that there is a class of languages learnable from informants 
and not learnable in our basic model. This means that sometimes negative coun- 
terexamples are not enough - the learner must have access to all statements not 
belonging to the language! (This result follows from a more general result for 
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Bc-style learning in Section 5). In particular, this result establishes certain con- 
straints on the learning power of our basic model. The proof of this result employs 
a new diagonalization technique working against machines learning via negative 
counterexamples. We also establish a hierarchy of learning capabilities in our 
basic model based on the number of errors that learner is allowed to have in 
the final hypothesis. Then we consider the model with restricted size of negative 
counterexamples (described above). We show that this model is different and 
weaker than our basic model. Still we show that this model is quite powerful: 
firstly, if restricted to the classes of infinite languages, they are equivalent to 
the basic model, and, secondly there are learnable classes in these models that 
cannot be learned in classical Bc-model (without negative counterexamples) - 
even if an arbitrary finite number of errors is allowed in the correct conjectures. 
Corollary 2 to the proof of Theorem 7 shows that there exists an indexed class 
of recursively enumerable languages that cannot be learned with negative coun- 
terexamples of restricted size (note that all such classes are learnable in our basic 
Ex-style model as stated in Theorem 1). 

Section 5 is devoted to our Bc-style models. As in the case of Ex-style learn- 
ing, we show that providing the least negative counterexample does not enhance 
the power of a learner. Then we show that languages learnable in our basic 
Bc-style model (without errors) are Bc-learnable from informants. In the end 
we establish one of our most surprising results. First, we demonstrate that the 
power of our basic Bc-style model is limited: there are classes of recursively 
enumerable languages not learnable in this model if no errors in almost all (cor- 
rect) conjectures are allowed. On the other hand, there exists a learner that 
can learn all recursively enumerable languages in this model with at most one 
error in almost all correct conjectures! Based on similar ideas, we obtain some 
other related results - in particular, that, with one error allowed in almost all 
correct conjectures, the class of all infinite recursively enumerable languages is 
learnable in the model with restricted size of negative counterexamples. To prove 
these results, we developed a new type of learning algorithms, where a learner 
makes deliberate error in its conjectures to force a teacher to answer questions 
not directly related to the input language. In contrast to the case with no er- 
rors, we also show that when errors are allowed, Bc-learning from informants 
is a proper subset of our basic model of Bc-learning with errors and negative 
counterexamples (Corollary 4). 

We also considered variants of the basic model when negative counterexam- 
ples are delayed or not provided at all (in the latter case the learner just gets 
the answer “no” if a current conjecture is not a subset of the input language; 
this model corresponds to learning with restricted subset queries [Ang88]). We 
showed that, surprisingly, both these variants are equivalent to the basic model! 
In other words, in the subset queries about the conjectures, the learner does 
not even need counterexamples - just the answers “yes” or “no” are enough (of 
course, the lack of counterexamples may delay convergence to the right conjec- 
ture, but we do not address this issue). To prove these results, we developed and 
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employed for our model a variant of stabilizing /locking sequence [JORS99]. This 
topic is handled in the full version of the paper. 

2 Notation and Preliminaries 

Any unexplained recursion theoretic notation is from [Rog67]. The symbol N 
denotes the set of natural numbers, {0, 1, 2, 3, . . .}. Cardinality of a set S is 
denoted by card(5'). L 1 AL 2 denotes the symmetric difference of Li and L 2 , 
that is L 1 A.L 2 = {Li — L 2 ) U (L 2 ~ L/. For a natural number a, we say that 
Li =“ L 2 , iff card(LiAL 2 ) < a. We say that Li =* L 2 , iff card(LiAL2) < 00 . 
If Li =“ L 2 , then we say that L\ is an a- variant of L 2 - 

We let {Wi}i^N denote an acceptable numbering of all r.e. sets. Symbol S 
will denote the set of all r.e. languages. Symbol L, with or without decorations, 
ranges over £. By \l we denote the characteristic function of L. By L, we denote 
the complement of L, that is N — L. Symbol C, with or without decorations, 
ranges over subsets of £. By Wi^s we denote the set Wi enumerated within s 
steps, in some standard method of enumerating Wi. 

We now present concepts from language learning theory. A text (see [Gol67]) 
is a mapping from N to N U {#}. We let T range over texts. content(T) is 
defined to be the set of natural numbers in the range of T (i.e. content(T) = 
range(T) — {#}). T is a text for L iff content(T) = L. That means a text 
for L is an infinite sequence whose range, except for a possible #, is just L. 
Intuitively, T{i) represents the (z+l)-th element in a presentation of L, with ff’s 
representing pauses in the presentation. T[n] denotes the finite initial sequence, 
r(0),T(l),...,r(z— l),ofr of length n. A finite sequence is an initial sequence 
of a text (i.e., it is a mapping from an initial segment of N into {N U {#})). The 
content of a finite sequence cr, denoted content (cr), is the set of natural numbers 
in the range of cr. The empty sequence is denoted by A. The length of cr, denoted 
by |cr|, is the number of elements in a. So, |A| = 0. For n < |cr|, the initial 
sequence of a of length n is denoted by cr[rz]. We let cr, r, and 7 range over finite 
sequences. We denote the sequence formed by the concatenation of r at the end 
of CT by err. SEQ denotes the set of all finite sequences. 

We say that a recursive function I is an informant for L iff for all x, I{x) = 
(x, Xl{x)). Intuitively, informants give both all positive and all negative data on 
the language being learned. I[n] is the first n elements of the informant I. 

A language learning machine (from texts) is a computable mapping from SEQ 
into N . One can similarly define language learning machines from informants. 
We let M, with or without decorations, range over learning machines. M(T[rz]) 
(or M(/[zz])) is interpreted as the grammar (index for an accepting program) 
conjectured by the learning machine M on the initial sequence T[n] (or /[n]). We 
say that M converges on T to z, (written: M(T)j, = i) iff (V°“zz)[M(T[n]) = z]. 
Convergence on informants is similarly defined. 

There are several criteria for a learning machine to be successful on a lan- 
guage. Below we define some of them. All of the criteria defined below are variants 
of the Ex-style and Bc-style learning described in the Introduction; in addition. 
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they allow a finite number of errors in almost all conjectures (uniformly bounded, 
or arbitrary). 

Definition 1. [Gol67,CL82] Suppose a £ N U {*}. 

(a) M TxtEiX°' -identifies an r.e. language L (written: L £ TxtEx“(M)) just 
in case for all texts T for L, {3i \ Wi =“ content(T)) (V°“n)[M(T[n]) = i]. 

(b) M -identifies a class C of r.e. languages (written: C C 

TxtEx“(M)) just in case M TxtEx“-identifies each language from C. 

(c) TxtEx“ ={CCS\ (3M)[£ C TxtEx“(M)]}. 

Definition 2. [CL82] Suppose a G U {*}. 

(a) M TxtBc“ -zdentz/ies an r.e. language L (written: L £ TxtBc“(M)) just 
in case for all texts T for L, (V°“n)[LbM(T[n]) =“ L]- 

(b) M -identifies a class C of r.e. languages (written: C C 

TxtBc“(M)) just in case M TxtBc“-identifies each language from C. 

(c) TxtBc“ = {£ C f I (3M)[£ C TxtBc“(M)]}. 

One can similarly define learning from informants, resulting in criteria 
InfEx“ and InfBc“. We refer the reader to [Gol67,CL82,JORS99] for details. 

For a = 0, we often write TxtEx, TxtBc, InfEx, InfBc instead of 
TxtEx®, TxtBc®, InfEx®, InfBc®, respectively. 

C is said to be an indexed family of languages iff there exists an indexing 
Lq,Li, . . . of languages in C such that the question x £ Li is uniformly decidable 
(i.e., there exists a recursive function / such that f{i,x) = XLi(x)). 

3 Learning with Negative Connterexamples 

In this section we define three models of learning languages from positive data 
and negative counterexamples. Intuitively, for learning with negative counterex- 
amples, we may consider the learner being provided a text, one element at a 
time, along with a negative counterexample to the latest conjecture, if any. (One 
may view this negative counterexample as a response of the teacher to the subset 
query when it is tested if the language generated by the conjecture is a subset 
of the target language) . One may model the list of negative counterexamples as 
a second text for negative counterexamples being provided to the learner. Thus 
the learning machines get as input two texts, one for positive data, and other 
for negative counterexamples. We say that M(T, T') converges to a grammar i, 
iff for all but finitely many n, M(T[n], T'[n]) = i. 

First, we define the basic model of learning from positive data and negative 
counterexamples. In this model, if a conjecture contains elements not in the 
target language, then a negative counterexample is provided to the learner. NC 
in the definition below stands for negative counterexample. 

Definition 3. Suppose a £ N \J {*}. 

(a) M NCEx“ -zdentz/ies a language L (written: L £ NCEx“(M)) iff for all 
texts T for L, and for all T' satisfying the condition: 
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T'(n) €Sn,i^Sn^% and T'(n) = #, if = 0, 
where 5'„ = L n Wm(T[nlT'[n\) 

M(T, T') converges to a grammar i such that Wi =“ L. 

(b) M NCEx“-z(ientz/jes a class C of languages (written: C C NCEx“(M)), 
iff M NCEx“-identifies each language in the class. 

(c) NCEx“ = {£ I (3M)[£ C NCEx“(M)]}. 

We also define two variants of above definition as follows: 

— the learner gets least negative counterexample instead of any counterex- 
ample. This criteria is denoted LNCEx“. 

— the negative counterexample is provided only if there exists one such coun- 
terexample < the maximum positive element seen in the input so far (otherwise 
the learner gets #). This criteria is denoted by BNCEx“. (Essentially in the 
definition of T'{n) in part (a) is replaced by S'„ = L fl W]v[(T[n],T'[n]) H {cc | x < 
max(content(T[n]))}). 

Similarly, we can define NCBc“, LNCBc“ and BNCBc“ criteria of inference. 

The BNC model essentially addresses some complexity constraints. In the 
full paper, we also consider generalization of BNC-model above where we con- 
sider negative counterexamples within a recursive factor h of the maximum pos- 
itive element seen in the input so far. 

It is easy to see that TxtEx“ C BNCEx“ C NCEx“ C LNCEx“. All of 
these containments will be shown to be proper, except the last one. 

4 Ex-type Learning with Negative Counterexamples 

We first show that every indexed family can be learned using positive data and 
negative counterexamples. This improves a classical result that every indexed 
family is learnable from informants. Since there exist indexed families not in 
TxtEx, this illustrates a difference between NCEx learning and learning with- 
out negative counterexamples. 

Theorem 1. Suppose C is an indexed family. Then C G NCEx. 

We now illustrate another difference between NCEx learning and TxtEx 
learning. Following result does not hold for TxtEx [Gol67] (for example, {F \ F 
is finite } U {L} ^ TxtEx, for any infinite language L). 

Theorem 2. Suppose L € NCEx and L is a recursive language. Then £U{L} G 

NCEx. 

The above result however doesn’t generalize to taking r.e. language (instead of 
recursive language) L, as witnessed by £ = {{A U {x}} | x ^ A}, and L = A, 
where A is any non-recursive r.e. set. Here note that £ G TxtEx, but £ U {£} 
is not in NCEx. 

The following theorem shows that using least negative counterexamples, 
rather than arbitrary negative counterexamples, does not enhance power of a 
learner - this is applicable also in case when a learner can make a finite bounded 
number of mistakes in the final conjecture. 
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Theorem 3. Suppose a £ N U {*}. Then, 

(a) NCEx“ = LNCEx“. 

(b) LNCEx“ C InfEx“. 

The next result is somewhat surprising. It shows that sometimes negative 
counterexamples are not enough: to learn a language, the learner must have 
access to all negative examples. (In particular, it demonstrates a limitation on 
the learning power of our basic model) . 

Theorem 4. InfEx — NCEx* ^ 0. 

We now show the error hierarchy for NCEx-learning. That is, learning with 
at most n + 1 errors in almost all conjectures in our basic model is stronger than 
learning with at most n errors. The hierarchy easily follows from the following 
theorem. 

Theorem 5. Suppose n £ N. 

(a) TxtEx”+^ - NCEx” ^ 

(b) TxtEx* - U„6iv NCEx” ^ 0. 

As, TxtEx”+^ C BNCEx”+^ C NCEx”+^ C LNCEx”+\ the following 
corollary follows from Theorem 5. 

Corollary 1. Suppose n £ N. Then, fori £ {NCEx, LNCEx, BNCEx}, we 

have I” C P+b 

Now we demonstrate yet another limitation on the learning power of our 
basic model when an arbitrary finite number of errors is allowed in the final 
conjecture: there are languages learnable within the classical Bc-style model 
(without negative counterexamples) and not learnable in the above variant of 
our basic model. 

Theorem 6. TxtBc — NCEx* yf 0. 

We now turn to the model where size of negative counterexamples is re- 
stricted: BNCEx. We first show that there are classes of languages learnable in 
our basic model that cannot be learned in any of the models that use negative 
counterexamples of limited size. 

Theorem 7. NCEx - BNCBc* ^ 0. 

However, the following theorem shows that if attention is restricted to only 
infinite languages, then NCEx and BNCEx behave similarly. 

Theorem 8. Suppose L consists of only infinite languages. Then C G NCEx“ 
iff C£ BNCEx“. 

Proof, (sketch) As BNCEx” C NCEx”, it suffices to show that if £ G NCEx” 
then C £ BNCEx”. Suppose M NCEx“-identifies £. Define M' as follows. M' 
on the input text T of positive data for an infinite language L behaves as follows. 
Initially let Cntrexmpls = 0. Intuitively, Cntrexmpls denotes the set of negative 
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counterexamples received so far. Initially let NegSet = 0. Intuitively, NegSet 
denotes the set of grammars for which we know a negative counterexample. For 
j € NegSet, ncex(j) would denote a negative counterexample for j. For ease of 
presentation, we will let M' output more than one conjecture (one after another) 
at some input point and get negative counterexamples for each of them. This is 
for ease of presentation and one can always spread out the conjectures. 

Stage s (on input T[s]) 

1. Simulate M on T[s], by giving negative counterexamples to any conjectures 

j € NegSet by ncex{j). Other grammars get # as counterexample. 

2. Let S be the set of conjectures output by M, in the above simulation, on 

initial segments of T[s], and let k be the final conjecture. 

3. If k ^ NegSet, output a grammar for Uigs-NegSet 

Otherwise (i.e., if fc G NegSet), output a grammar for [{Wk — Cntrexmpls) U 
U*6S-NegSet 

4. If there is no negative counterexample, then go to stage s + 1. 

5. Else (there is a negative counterexample), output one by one, for each i G 

S — NegSet, grammar i. If a negative counterexample is obtained, then 
place i in NegSet and define ncex{i) to be this negative counterexample. 
(Note that since M' is for BNCEx-type learning, negative examples received 
would be < max(content(T[s])), if any). 

Update Cntrexmpls based on new negative counterexamples obtained. 

6. Go to stage s + 1. 

End Stage s. 

Suppose T is a text for an infinite language L € C. Then, based on conjectures 
made at step 3 and 5, and the negative counterexamples received, one can argue 
that for all but finitely many stages, answers given in step 1 to machine M 
are correct. (Since, one would eventually place any conjecture of M on text T, 
which enumerates a non-subset of L, in NegSet due to steps 3 and 5). Thus, 
for all but finitely many stages, the grammar output in step 3 would be correct 
(except for possibly a errors of omission by the final grammar of M), and M' 
BNCEx“-identifies L. | 

Our next result shows that the model BNCEx, while being weaker than our 
basic model, is still quite powerful: there are classes of languages learnable in 
this model that cannot be learned in the classical Bc-style model even when an 
arbitrary finite number of errors is allowed in almost all conjectures. 

Theorem 9. BNCEx - TxtBc* yf 0. 

Note that the diagonalizations in Theorems 7 and 9 can be shown using 
indexed families of languages. Thus, in contrast to Theorem 1, we get 

Corollary 2. There exists an indexed family not in BNCBc*. 
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5 Be- Type Learning with Negative Connterexamples 

In this section we explore Bc-style learning from positive data and negative 
counterexamples. First we show that, for Bc-style learning, similarly to Ex- 
style learning, our basic model is equivalent to learning with the least negative 
counterexamples. 

Proposition 1. (a) NCBc = LNCBc. 

(b) LNCBc C InfBc. 

Our next result shows that, to learn a language, sometimes even for Bc- 
style learning, positive data and negative counterexamples are not enough - the 
learner must have access to all negative data. In particular, limitations on the 
learning power of our basic Bc-style model are established. 

Theorem 10. InfEx — NCBc yf 0. 

Proof. Let C = {L \ (3e)[min(L) = 2e] A 

(i) [L = We and (Vx > e)[LC\ {2x, 2x -I- 1} yf 0]] . 

OR 

(ii) (3x > e)[Ln {2x, 2x -I- 1} = 0, A (Vy > 2x + l)[y & L]] 

} 

It is easy to verify that £ G InfEx. A learner can easily find e as above, 
and whether there exists x > e such that both 2x,2x + 1 are not in the input 
language. This information is sufficient to identify the input language. 

We now show that £ ^ NCBc. Intuitively, the idea is that a learner which 
learns a language satisfying clause (ii) above, must output infinitely many gram- 
mars properly extending the input seen upto the point of conjecture. By repeat- 
edly searching for such input and conjectures, one can identify one element of 
each such conjecture as negative counterexample, allowing one to construct We 
as needed for clause (i) as well as diagonalizing out of NCBc. Note that we 
needed a pair {2x,2x + 1}, to separate (i) from (ii) as one of the elements may 
be needed for giving the negative counterexamples as mentioned above. We now 
proceed formally. 

Suppose by way of contradiction that machine M NCBc-identifies £. Then 
by the Kleene Recursion Theorem [Rog67] there exists a recursive function e 
such that We may be defined as follows. 

Initially, let We = {2e, 2e-|- 1} and (Tq be such that content((Jo) = {2e, 2e-|- 1}. 
Intuitively Cntrexmpls denotes the set of elements frozen to be outside the di- 
agonalizing language being constructed. Initially, Cntrexmpls = {x | x < 2e}. 
Intuitively, NegSet is the set of conjectured grammars for which we have found 
a negative counterexample (in Cntrexmpls). Initially let NegSet = 0. ncex(j) 
is a function which gives, for j G NegSet, a negative counterexample from 
Cntrexmpls. For the following, let 7 r be a sequence of length |r| defined as 
follows. For i < |t|. 



r ncex(M(r[t], 7 T-[t])), if M(r[t], 7 T-[t]) G NegSet; 
( #, otherwise. 
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(where the value of NegSet is as at the time of above usage) . 

Let xq = 2e + 2. Intuitively, Xg is the least even element greater than 
max(content(CTs) U Cntrexmpls). Also we will have the invariant that at start 
of stage s, 

(i) every element < Xg is either in content(crs) or Cntrexmpls and 

(ii) content(CTs) consists of elements enumerated in Wg before stage s. 

Go to stage 0. 

Stage s 

1. Dovetail steps 2 and 3 until step 2 or 3 succeed. If step 2 succeeds before 

step 3, if ever, then go to step 4. If step 3 succeeds before step 2, if ever, 
then go to step 5. 

Here we assume that if step 3 can succeed by simulating M(t, 7 ,-) for s 
steps, then step 3 succeeded first (and for the shortest such r), otherwise 
whichever of these steps succeeds first is taken. (So some priority is given 
to step 3 in the dovetailing). 

2. Search for a t A Us such that 

content(r) C content(cTs) \J {x \ x > Xg + 2}, 

M(t, ^ NegSet and 

enumerates an element not in content(r). 

3. Search for a r C ag such that M(t, 7 ,-) ^ NegSet and VEm(t, 7 ^) enumerates 

an element not in content (ct^). 

4. Let T be as found in step 2, and j = M(t, 7 ,-), and z be the element found 

to be enumerated by Wj which is not in content (r). 

Let NegSet = NegSet U {j}. 

Let Cntrexmpls = Cntrexmpls U {z}. 

Let ncex{j) = z. 

Let a;s+i be the least even number > max(content(r) U {a;^,^}). 

Enumerate {x \ Xg < x < ccs+i} — {z} in Wg- 

Let Og+i be an extension of r such that content((Ts+i) = Wg enumerated 
until now. 

Go to stage s + 1. 

5. Let T be as found in step 3, and j = M(r, 7 ,-), and z be the element found 

to be enumerated by Wj which is not in content (ct^). 

Let NegSet = NegSet U {j}. 

Let Cntrexmpls = Cntrexmpls U {z}. 

Let ncex{j) = z. 

Let a;s+i be the least even number > max({a:s, z}). 

Enumerate {x \ Xg < x < Xs+i} — {z} in Wg. 

Let (Tg+i be an extension of Ug such that content(i 7 s+i) = Wg enumerated 
until now. 

Go to stage s + 1. 

End stage s 
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We now consider the following cases: 

Case 1: Stage s starts but does not finish. 

In this case let L = We U {x | a; > Xg + 2}. Note that, due to non-success 
of steps 2 and 3, the negative information given in computation of 7 ,- based 
on NegSet is correct. Thus, for any text T for L extending as, for n > |cTg|, 
M(T[n], 7 T[n]) G NegSet or it enumerates only a finite set (otherwise step 2 
would succeed). Thus, M does not NCBc-identify L. 

Case 2: All stages finish. 

Let L = We- Let T = as- Note that T is a text for L. Let Cntrexmpls 
denote the set of all elements which are ever placed in Cntrexmpls by the above 
construction. Note that eventually, any conjecture j by M on (T, which enu- 
merates an element not in L, belongs to NegSet, with a negative counterexample 
for it belonging to Cntrexmpls (given by ncex{j)). This is due to eventual suc- 
cess of step 3, for all t CT, for which M(t, 7 ,-) % L (due to priority assigned to 
step 3). 

If there are infinitely many t CT such that M(t, 7 ,-) % L, then clearly, M 
does not NCBc-identify L. On the other hand, if there are only finitely many 
such T, then clearly all such r would have been handled by some stage s, and 
beyond stage s, step 3 would never succeed. Thus, beyond stage s computation of 
M(t, 7 t), as at stage s step 2, is always correct (with negative counterexamples 
given, whenever necessary), and step 2 succeeds infinitely often. Thus again 
infinitely many conjectures of M on (T, 7 ^) are incorrect (and enumerate an 
element of L), contradicting the hypothesis. 

From above cases it follows that M does not NCBc-identify L. Theorem 
follows. I 

Our next result shows that all classes of languages learnable in our basic 
Ex-style model with arbitrary finite number of errors in almost all conjectures 
can be learned without errors in the basic Bc-style model. Note the contrast 
with learning from texts where TxtEx^-’^^ — TxtBc^ yf 0 [CL82]. 

Theorem 11. NCEx* C NCBc. 

Next theorem establishes yet another limitation on the learning power of 
our basic Bc-style learning: some languages not learnable in this model can 
be Bc-learned without negative counterexamples if only one error in almost all 
conjectures is allowed. 

Theorem 12. TxtBc^ — NCBc yf 0. 

Now we turn to Bc-style learning with limited size of negative counterex- 
amples. First, note that Theorem 9 gives us: BNCEx — TxtBc* yf 0. In other 
words, some languages Ex-learnable with negative counterexamples of limited 
size cannot be Bc-learned without negative counterexamples even with an arbi- 
trary finite number of errors in almost all conjectures. 

On the other hand, as Theorem 7 shows, some languages learnable in our 
basic Ex-style learning with negative counterexamples cannot be learned in Bc- 
model with limited size of negative counterexamples even if an arbitrary finite 
number of errors is allowed in almost all conjectures. 
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Now we establish one of our most surprising results: there exists a Bc-style 
learner with negative counterexamples, allowing just one error in almost all con- 
jectures, with the “ultimate power” - it can learn the class of all recursively 
enumerable languages! 

Theorem 13. £ G NCBc^. 

Proof. First we give an informal idea of the proof. Our learner can clearly test if 
a particular Ws Q L. Given an arbitrary initial segment of the input T[n], we will 
want to test if content(T[n]) % Wg for any r.e. set Wg C L, where L is a target 
language. Of course, the teacher cannot directly answer such questions, since Wg 
might not be the target language (note also that the problem is undecidable). 
However, the learner hnds a way to encode this problem into a current conjecture 
and test if the current conjecture generates a subset of the target language. 
In order to do this, the learner potentially makes one deliberate error in its 
conjecture! We now proceed formally. 

Dehne M on the input text T as follows. Initially, it outputs a grammar 
for N. If it does not generate a negative counterexample, then we are done. 
Otherwise, let c be the negative counterexample. Go to stage 0. 

Stage s 

1. Output grammar s. If it generates a negative counterexample, then go to 

stage s -I- 1. 

2. Else, 

For n = 0 to oo do: 

Output a grammar for language X, where X is dehned as follows: 

^ _ f 0, if content(T[n]) % Wg; 

( IFs U {c}, otherwise. 

If it does not generate a negative counterexample, then go to stage 
s J- 1, 

Otherwise continue with the next iteration of For loop. 

EndFor 
End stage s 

We now claim that above M NCBc^-identihes £. Glearly, if L = N, then M 
NCBc^-identihes L. Now suppose L ^ N. Let c be the negative counterexample 
received by M for N. Let j be the least grammar for L, and T be a text for L. 
We claim that all stages s < j will hnish, and stage j will not hnish. To see this 
consider any s < j. 

Case 1: Wg % L. 

In this case note that step I would generate a negative counterexample, and 
thus we will go to stage s -I- 1. 

Gase 2: Not Gase 1 (i.e., Wg C L but L Wg). 

In this case, let m be least such that content(T[m]) ^ Wg. Then, in the 
iteration of For loop in step 2, with n = m, the grammar output is for 0. Thus, 
there is no negative counterexample, and algorithm proceeds to stage s -I- 1. 
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Also, note that in stage s = j, step 1 would not get a negative counterexam- 
ple, and since c ^ L, every iteration of For loop will get a negative counterexam- 
ple. Thus, M keeps outputting grammar for WjU{c}. Hence M NCBc^-identifies 
L. Thus, we have that M NCBc^-identifies £. | 

Since £ € InfBc*, we have 

Corollary 3. faj NCBc^ = InfBc*. 

(bj For all a gnu {*}, NCBc“ = LNCBc“. 

The following corollary shows a contrast with respect to the case when there 
are no errors in conjectures (Proposition 1 and Theorem 10). What a difference 
just one error can make! Using the fact that InfBc” C InfBc* (see [CS83]), we 
get 

Corollary 4. For all n>0, InfBc” C NCBc” = NCBc^ 

The ideas of the above theorem are employed to show that all infinite re- 
cursively enumerable languages can be learned in our basic Bc-style model with 
negative counterexamples of limited size allowing just one error in almost all 
conjectures. Note that, as we demonstrated in Corollary 2, contrary to the case 
when there are no limits on the size of negative counterexamples, such learners 
cannot learn the class of all recursively enumerable languages. 

Theorem 14. Let C = {L G £ \ L is infinite }. Then C G BNCBc^. 

As there exists a class of infinite languages which does not belong to InfBc” 
(see [CS83]), we have 

Corollary 5. For all uGN, BNCBc^ - InfBc” yf 0. 

Thus, BNCBc™ and InfBc” are incomparable for m > 0. The above result 
does not generalize to InfBc*, as InfBc* contains the class £. 

Based on the ideas similar to the ones used in Theorem 13, we can show 
that all classes of languages Bc"-learnable without negative counterexamples 
can be Bc-learned with negative counterexamples of limited size when one error 
in almost all conjectures is allowed. 

Theorem 15. (a) For all n G N , TxtBc” C BNCBc^. 

(b) TxtEx* C BNCBc\ 

Similarly to the case of Ex-style learning, BNCBc and NCBc models turn 
out to be equivalent for the classes of infinite languages. 

Theorem 16. Suppose C consists of only infinite languages. Then L G NCBc 
iff Cg BNCBc. 

We now mention some of the open questions regarding behaviourally correct 
learning when the size of the negative counterexamples is bounded. 

Open Question: Is BNCBc” hierarchy strict? 

Open Question: Is TxtBc* C BNCBc^? 
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Abstract. In this paper, we study inferability of term rewriting systems 
from positive examples alone. We define a class of simple flat term rewrit- 
ing systems that are inferable from positive examples. In flat term rewrit- 
ing systems, nesting of defined symbols is forbidden in both left- and 
right-hand sides. A flat TRS is simple if the size of redexes in the right- 
hand sides is bounded by the size of the corresponding left-hand sides. 
The class of simple flat TRSs is rich enough to include many divide-and- 
conquer programs like addition, doubling, tree-count, list-count, split, 
append, etc. The relation between our results and the known results on 
Prolog programs is also discussed. In particular, flat TRSs can define 
functions (like doubling), whose output is bigger in size than the input, 
which is not possible with linearly-moded Prolog programs. 



1 Introduction 

Starting from the influential works of Gold [6] and Blum and Blum [4] , a lot of 
effort has gone into developing a rich theory about inductive inference and the 
classes of concepts which can be learned from both positive (examples) and neg- 
ative data (counterexamples) and the classes of concepts which can be learned 
from positive data alone. The study of inferability from positive data alone is 
important because negative information is hard to obtain in practice -positive ex- 
amples are much easier to generate by conducting experiments than the negative 
examples in general. In his seminal paper [6] on inductive inference. Gold proved 
that even simple classes of concepts like the class of regular languages cannot be 
infered from positive examples alone. This strong negative result disappointed 
the scientists in the held until Angluin [1] has given a characterization of the 
classes of concepts that can be infered from positive data alone and exhibited a 
few nontrivial classes of concepts inferable from positive data. This influential 
paper inspired further research on the inductive inference from positive data. 
Since then many positive results are published about inductive inference of logic 
programs and pattern languages from positive data (see a.o., [10,2,3,11,8]). To 
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the best of our knowledge, inductive inference of term rewriting systems from 
positive data has not been studied so far. 

In the last few decades, term rewriting systems have played a fundamental 
role in the analysis and implementation of abstract data type specifications, 
decidability of word problems, theorem proving, computability theory, design 
of functional programming languages (e.g. Miranda), integration of functional 
and logic programming paradigms, etc. In particular, term rewriting systems are 
very similar to functional programs and any learning results on term rewriting 
systems can be transfered to functional programs. 

In this paper, we propose a class of simple flat term rewriting systems that 
are inferable from positive examples. In flat TRSs, nesting of defined symbols 
is forbidden in both left- and right-hand sides. A flat TRS is simple if the size 
of redexes in the right-hand sides is bounded by the size of the corresponding 
left-hand sides. This condition ensures that the reachability problem over flat 
terms is decidable. We prove this result by providing an algorithm to decide 
whether a flat term t reduces to a term m by a simple flat TRS R or not. 

Simple flat TRSs have a nice property that the size of redexes in a derivation 
starting from a flat term t is bounded by the size of the initial term (actually, 
the size of redexes in t) and the size of the caps of intermediate terms in a 
derivation t u starting from a flat term is bounded by the size of the cap 
of the final term, where cap of a term is defined as the largest top portion of 
the term containing just constructor symbols and variables. These properties 
guarantee that we only need to consider rewrite rules whose sides are bounded 
by the size of the examples in learning simple flat TRSs from positive data. 

The class of simple flat TRSs is rich enough to include many divide-and- 
conquer programs like addition, doubling, tree-count, list-count, split, append, 
etc. The relation between our results and the known results on Prolog programs 
is discussed in a later section. In particular, flat TRSs can define functions (like 
doubling), whose output is bigger in size than the input, which is not possible 
with linearly-moded Prolog programs. On the other hand, flat TRSs do not allow 
nesting of defined symbols in the rewrite rules, which means that we cannot 
define functions like quick-sort, that can be defined by a linearly-moded Prolog 
program. 

The rest of the paper is organized as follows. The next section gives pre- 
liminary definitions and results needed about inductive inference. In section 3, 
we define the class of simple flat TRSs and establish a few properties about 
intermediate terms in derivations of these systems over flat terms. In section 4, 
we prove decidability of t u and then the inferability of simple flat TRSs 
from positive data is established in section 5. The final section concludes with a 
discussion on open problems. 




Inductive Inference of Term Rewriting Systems from Positive Data 



71 



2 Preliminaries 

We assume that the reader is familiar with the basic terminology of term rewrit- 
ing and inductive inference and use the standard terminology from [7,5] and [6, 
10]. In the following, we recall some deifinitions and results needed in the sequel. 

The alphabet of a first order language L is a tuple {S, X) of mutually disjoint 
sets such that if is a finite set of function symbols and X is a set of variables. . In 
the following, T{S,X) denotes the set of terms constructed from the function 
symbols in S and the variables in X. The size of a term t, denoted by jtj, 
is defined as the number of occurrences of symbols (except the punctuation 
symbols) occurring in it. 

Definition 1. A term rewriting system (TRS, for short) 77. is a pair (A, R) 
consisting of a set S of function symbols and a set R of rewrite rules of the form 
I — >■ r satisfying: 

(a) I and r are first order terms in X), 

(b) left-hand-side I is not a variable and 

(c) each variable occuring in r also occurs in 1. 



Example 1. The following TRS defines multiplication over natural numbers. 

a(0,y) y 
a(s(x),y) s(a(x,y)) 



m(o,y) -> 0 

m(s(x),y) a(y,m(x,y)) 

Here, a stands for addition and m stands for multiplication. □ 



Definition 2. A eontext C[, . . . , ] is a term in T(A' U {□}, X). If C[, . . . , ] is a 
context containing n occurrences of □ and ti, . . . ,t„ are terms then C[ti, . . . ,t„] 
is the result of replacing the occurrences of □ from left to right by ti, . . . ,t„. A 
context containing precisely 1 occurrence of □ is denoted C[]. 



Definition 3. The rewrite relation =^tz induced by a TRS 77 is defined as 
follows: s t if there is a rewrite rule / — >■ r in 77, a substitution a and a 
context C[ ] such that s = C[l a] and t = C[r<j]. 

We say that s reduces to t in one rewrite (or reduction) step if s =^tz t and say 
s reduces to t (or t is reachable from s) if s t, where is the transitive- 
reflexive closure of =^7?,). The subterm la in s is called a redex. 
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Example 2. The following derivation shows a computation of the value of 
term m(s(s(0)), s(s(s(0)))) by the above TRS. 

m(s(s(0)),s(s(s(0)))) a(s(s(s(0))),m(s(0),s(s(s(0))))) 

a(s(s(s(0))),a(s(s(s(0))),m(0,s(s(s(0)))))) 

a(s(s(s(0))),a(s(s(s(0))),0)) 

s(a(s(s(0)),a(s(s(s(0))),0))) 

^ s(s(a(s(0),a(s(s(s(0))),0)))) 

^ s(s(s(a(0,a(s(s(s(0))),0))))) 

^ s(s(s(a(s(s(s(0))),0)))) 

^ s(s(s(s(s(s(0)))))) 

This is one of the many possible derivations from m(s(s(0)), s(s(s(0)))). Since 
the system is both terminating and confluent, every derivation from this term 
ends in the final value s(s(s(s(s(s(0)))))). □ 



Remark 1. The conditions (b) left-hand-side I is not a variable and (c) each 
variable occuring in r also occurs in I of Definition 1 avoid trivial nonterminating 
computations. If there is a rewrite rule x ^ r with a varible left-hand-side is 
present in a TRS, every term can be rewritten by this rule and hence no normal 
form exist resulting in nonterminating computations. If the right-hand-side r 
contains a variable y not present in the left-hand-side I of a rule I — >■ r, then the 
term I can be rewritten to C[l] (substitution a replacing the extra-variable by 1) 
resulting in ever growing terms and obvious nontermination. 



Definition 4. Let U and E be two recursively enumerable sets, whose elements 
are called objects and expressions respectively. 

— A concept is a subset ECU. 

— An example is a tuple {A, a) where A £ U and a = true or false. Example 
{A, a) is positive if a = true and negative otherwise. 

— A concept E is consistent with a sequence of examples (Ai, oi), . . ., {Am, Om) 
when Ai £ E ii and only if Oi = true, for each i £ [1, m]. 

— A formal system is a finite subset R C E. 

— A semantic mapping is a mapping <I> from formal systems to concepts. 

— We say that a formal system R defines a concept E if ‘P{R) = E. 



Definition 5. A concept defining framework is a triple {U, E, <E) of a universe 
U of objects, a set E of expressions and a semantic mapping ‘P. 



Definition 6. A class of concepts C = {Ei, E 2 , . . .} is an indexed family of 
recursive concepts if there exists an algorithm that decides whether w £ Ei for 
any object w and natural number i. 
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Here onwards, we fix a concept defining framework {U, E,<P) arbitrarily and 
only consider indexed families of recursive concepts. 

Definition 7. A positive presentation of a nonempty concept A C is an 
infinite sequence wi,W 2 ,--- of objects (positive examples) such that {wi \ i > 

\} = r. 

An inference machine is an effective procedure that requests an object as an 
example from time to time and produces a concept (or a formal system defining 
a concept) as a conjecture from time to time. Given a positive presentation 
a = Wi,W 2 , ■ ■ ■, an inference machine IM generates a sequence of conjectures 
5i , 32) • ■ We say that IM converges to g on input a if the sequence of conjectures 
gi,g 2 i - ■ ■ is finite and ends in g or there exists a positive integer fcp such that 
9k = 9 for all k > fcg. 

Definition 8. A class C of concepts is inferable from positive data if there exists 
an inference machine IM such that for any F G C and any positive presentation 
a of r, IM converges to a formal system g such that d>{g) = F. 

We need the following result of Shinohara [10] in proving our result. 

Definition 9. A semantic mapping F is monotonic ii R C R' implies <^(i?) C 
<P{R'). A formal system R is reduced w.r.t. S' C [7 if S' C <P{R) and S % <P{R') 
for any proper subset R' C R. 



Definition 10. A concept defining framework C = {U, E, F) has bounded finite 
thickness if 

1. F is monotonic and 

2. for any finite set S CU and any m > 0, the set {F{R) \ R is reduced w.r.t. 
S and |i?| < m} is finite. 



Theorem 1. (Shinohara [10]) 

If a concept defining framework C = {U, E, F) has bounded finite thickness, then 
the class 

C™ = {F{R) \ RCE,\R\< m} 
of concepts is inferable from positive data for every m> 1. 



3 Simple Flat Rewrite Systems 

In the following, we partition E into set D of defined symbols that may occur as 
the outermost symbol of left-hand-side of rules and set C of constructor symbols 
that do not occur as the outermost symbol of left-hand-side of rules. 
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Definition 11. (fiat terms and systems) 

— A term t in T{S, X) is flat if there is no nesting of defined symbols in t. 

— A TRS R is flat if left-hand sides and right-hand sides of all the rules in it 
are flat. 

— The cap of a term t in T{DUC, X) is defined as 
cap{t) = □ if root{t) G D, and 

cap{t) = f{cap{ti), . . .,cap{t„)) if t = f{ti, . . . ,t„) and f ^ D. 

— The multiset^ of maximal subterms with defined symbol roots in a term 
t G T{D U C, X) is defined as 

Dsub{f) = {f} if root{f) G D, and 

Dsuh\t) = Dsub{t\) U . . . U Dsub{tn)) if t = f{ti, . . . ,tn) and f G C. 

— We write t u if cap{f) is a subcap of cap{u), that is, we can obtain capfu) 
from cap{f) by substituting some holes in cap{f) by contexts. In other words, 
cap{f) is a top portion of cap{u). 



Definition 12. A flat rewrite rule I ^ r is simple if \la\ > |s(t| for every s G 
Dsub{r) and every substitution cr. A flat term rewriting system R is simple if it 
contains finitely many rules and each rule in it is simple. 

The above condition ensures that no variable occurs more often in a subterm 
of the right-hand side with defined symbol root than its occurrences in the left- 
hand side. This is formalized by the following lemma. 

Lemma 1. Let t and s be two terms. Then, \ta\ > |s(t| for every substitution 
a if and only if |t| > |s| and no variable occurs more often in s than in t. 

Proof sketch: If-part. Induction on number variables in Domain(a). 

Only-if-part. When a is the empty substitution, \ta\ > |sct| implies |f| > |s|. 
We can prove by contradiction that no variable occurs more often in s than in t 
if \tcr\ > |scr| for every substitution cr. □ 

It is easy to check (syntactically) whether a given TRS is flat or not. 

Theorem 2. It is decidable in linear time whether a given TRS is simple flat 
or not. 

Proof : Follows from the above lemma. □ 



Remark 2. Simple flat TRSs have some similarities to the length-bounded el- 
ementary formal systems [10]. Length-bounded elementary formal systems (LB- 
EFS) are logic programs such that the sum of the arguments of body atoms 
is bounded by the sum of the arguments of the head. However, there are some 
significant differences between them. 

^ Multisets allow repetitions of elements and the multiset {1,1,2} is different from the 
multiset {1,2}. Here, U stands for the multiset union. 
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(a) Elementary formal systems only have constants and function symbols of arity 
one, where as simple flat TRSs have function symbols of arbitrary arity, 

(b) Simple flat TRSs in with an additional condition |/cr| > |s(t| for every s G 
Dsub{r) and every substitution a are terminating, whereas LB-EFS with 
similar condition can be nonterminating. For example, the elementary formal 
system with clauses 

p(f(X)) ^ p(x) 
p(a) T- 

is length-bounded, but has an infinite SLD-derivation starting from an initial 
query ^ p{Y). 

It may be noted that the above LB-EFS terminates for all ground queries 
(though loops for nonground queries). In this sense, LB-EFSs are good for 
checking (whether an atom is in the least Herbrand model) but not good for 
computing, as computing involves variables in the queries and LB-EFS can 
be nonterminating for nonground queries even with the additional condition 
that body atoms are strictly smaller than the head. In contrast. Simple flat 
TRSs with such additional condition are good for both checking and com- 
puting as they are terminate for every term (ground as well as nonground). 

A nice property of simple flat TRSs is that the sum of the sizes of arguments 
of defined symbols in any derivation is bounded by the maximum sum of the 
sizes of arguments of the defined symbols in the initial term. This in turn ensures 
that it is decidable whether a term t reduces to a term u by a simple flat system 
or not. These facts are established in the following results. 

Lemma 2. Let i? be a simple flat TRS and t be a flat term such that n is 
greater than the size of every term in Dsub(t). If t u then n is greater than 
the size of every term in Dsub{u). Further, u is also a flat term and t ^ u. 

Proof : Let / — >■ r be the rule and a the substitution applied in t u (say 
at position p). That’s, t = C[la] and u = C[ra] for some context C. It is clear 
that Dsub{u) = {Dsubff) — {la}) U Dsub{ra). Since i? is a simple flat TRS, 
\la\ > |scr| for every s G Dsub{r) and hence n is greater than the size of every 
term in Dsub{u). 

Since t = C[la] is flat, no defined symbol occurs in a and hence ra is flat 
because r is flat. Therefore, u = C[ra] is flat. 

It is easy to see that cap{u) = cap{t)[cap{ra)]p and hence t ^ u. □ 

The following theorem states the above fact for any term reachable from t. 

Theorem 3. Let i? be a simple flat TRS and t be a flat term such that n is 
greater than the size of every term in Dsub(t). If t u then n is greater than 
the size of every term in Dsub{u) . Further, u is also a flat term and w ^ u for 
every term w in t w u. 
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Proof : Induction on the length of the derivation t and Lemma 2. □ 

The above theorem essentially says that the size of Dsuh-terms of intermedi- 
ate terms is bounded by the size of Dsub-terms of the initial term and the size 
of caps of intermediate terms is bounded by the size of the cap of the final term. 



4 Decidability of t u 

In this section, we prove that it is decidable whether a flat term t reduces to a 
term u by a simple flat system R or not, by providing a decision procedure for 
this problem. Throughout this section, we only consider flat terms and simple 
flat systems. 

In our decision procedure, we use left-to-right reduction strategy and keep 
track of reductions which did not increase the cap (for loop checking). The 
intermediate terms are annotated with this information, i.e., we consider tuples 
of the form (v,V), where w is a term and R is a set of redexes. The following 
definition gives next possible terms to consider. 

Definition 13. A tuple {w, W) is a possible successor to a tuple {v, V) in reach- 
ing a term u if 

1. V w by an application of rule / — >■ r at position p with substitution a, 

2. p is the leftmost redex position in v such that v/p ^ u/p, i.e., the subterms 
at position p in m and v differ. 

3. w ^ u, 

4. ra ^ la and ra ^ V — loop checking, 

5. W = 4> and root{r) is a constructor symbol OR 
W = (j) and ra = u/p OR 

W = V U {la}, ra ^ u/p and root{r) is a defined symbol. 



The set of all successors of {v, V) is denoted by NEXT{{v, V),u). 

Condition 2 ensures that the reduction is applied at the leftmost position that 
needs to be reduced. Condition 3 ensures that the reduction contributes to the 
cap of u and does not add any constructor symbols not in the cap of u. Condition 
4 ensures that same redex does not appear repeatedly at the same position (loop 
checking). Condition 5 is about book-keeping. If root{r) is a constructor symbol 
or ra = u/p, it is clear that no more reductions take place at p in ic (but 
reductions may take place below or to the right of p). Therefore, there is no 
need for any more loop checking at position p and hence W is set to (j). In the 
other case, la is added to W. 
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Example 3. Consider the following simple flat system. 

a(0,y) -)> y 
a(s(x),y) -)> s(a(x,y)) 
b ^ f(b) 
b^f(f(b)) 
b ^ g(b,b) 
b — > c 
c — b 
c — >■ c 
c — ^ d 

NEXT{{f(h),(j)),f{f{i{h)))) = {(f(f(b)),(/)), (f(f(f(b))),(/>), (f(c),{b})}. Note 
that 5th rule cannot be applied in computing NEXT as f(g(b,b)) f(f(f(b))), 

i.e., due to condition 3. When 6th rule is applied, the redex b is saved for loop 
checking as the right-hand side has a defined symbol as root. 

7VEXT((f(f(b)),</)),f(f(f(b)))) = {(f(f(f(b))),</>), (f(f(c))),{b})}. 

Note that 4th rule cannot be applied due to condition 3. 

fVifXT((f (c), {b}), f (f (f (b)))) = (f) because 7th rule and 8th rule cannot be 
applied due to condition 4 (loop checking) and 9th rule cannot be applied due 
to condition 3. 

7V£;XT((a(s(0),s(0)),^),s(s(0))) = { (s(a(0, s(0))), }. 

NEXT{{s{a.{0,s{0))), (j)),s{s{0))) = { (s(s(0)), (()) }. 



fVEXT((a(s(0),s(0)),^),s(s(s(0)))) = { (s(a(0, s(0))), }. 

fVEXT((s(a(0,s(0))),^i),s(s(s(0)))) = { (s(s(0)), </>) }. 

7V£;XT((s(s(0)),(/)),s(s(s(0)))) = 4>. □ 



Lemma 3. v w for every {w, W) in NEXT{{v, V),u). 

Proof : Straightforward from Definition 13. □ 

The following function checks whether a flat term t reduces to a term u by 
a simple flat system or not. Here, {u, —) stands for any tuple with u as the first 
element and {u, —) G S is true if there is a tuple in S with u as the first element. 

function REACHABILITY-CHECK(i?, f, u); 

begin 

if t :^u then Return(no); 

Queue := {{t, </>)}; 
while Queue yf (j) do 

begin 

Let (u, V) be the first element of Queue; 
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ii u = V then Return(yes); 

Queue := {Queue — {(w, R)}) U NEXT{{v, V)) 

end; 

Ret urn (no); 

end; 



The above function is exploring the search space of terms derivable from 
t in a breadth-first fashion (data structure QUEUE) looking for u. The search 
space is constructed conservatively by creating only potential intermediate terms 
leading to u and using loop checking. 

We prove the correctness of this decision procedure, in rest of this section. 

Lemma 4. t w for every {w, W) added to Queue during the execution of 
REACHABILITY-CHECK(i?, t, u). 

Proof : Follows from Lemma 3 and noetherian induction. □ 

The following theorem proves termination of reachability-check. 

Theorem 4. If i? is a simple flat TRS and t and u are two flat terms, the 
function call reachability-CHECk(R, t, u) terminates. 

Proof : Consider a tuple {w, W) added to Queue in the function call reachabi- 
LiTY-CHECK(i?, t, u) . By Lemma 4, t w. Let n be the smallest integer greater 
than the size of every term in Dsub{t). By Theorem 3, n is greater than the size 
of every term in Dsub{w). Further, w ^ u hy condition 3 of Definition 13, and 
hence the size of cap{w) is bounded by the the size of cap{u). Therefore the set 
of distinct tuples that are added to Queue is finite as both S and R are finite. 

Since we are following left-to-right reduction strategy and using loop check- 
ing, no tuple is added to Queue more than once. In every iteration of while 
loop, one tuple is deleted from Queue and hence REACHABiLiTY-CHECK(i?, t, u) 
terminates. □ 

We need the following results about left-to-right derivations. 

Definition 14. A derivation ^2 fn is a left-to-right derivation if 

there is no j £ [2,n — 1] such that reduction tj-i tj takes place at position 
Pi and reduction tj Q+i takes place at position p 2 such that p2 is to the left 
of pi. 



Lemma 5. If U, Q and Q are flat terms such that t\ t 2 Q and p 2 is 
to the left of pi, then there is a term t '2 such that ti ^p^ t '2 Q. 
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Proof : Let C be the context obtained from ti by replacing the subterms at 
positions Pi andp2 by holes, and ui = ti/pi, U 2 = ti/p 2 , fi = ^2/^1, V 2 = H!v 2 - 
It is easy to see that t\ = C'[ui,'U 2 ] t '2 = C['Ui,r; 2 ] tz = C[vi,V 2 \- CH 

Lemma 6. If i? is a simple flat TRS and t and u are flat terms such that t u, 

then there is a left-to-right derivation from t to u. 

Proof : We can obtain a left-to-right derivation from t to t6 by repeatedly applying 
Lemma 5. □ 



Lemma 7. If i? is a simple flat TRS and t, w and u are flat terms such that 
t w u is the shortest left-to-right derivation from t to u, then {w, — ) is 
added to Queue during the execution of reachability-CHECk(R, t, u). 

Proof : Induction on the length I of the derivation t w. 

Basis : I = 0. In this case, w = t and {t, 4>) is added to Queue in the initialization 
statement. 

Induction step : Consider the previous term w' in t w' w u. 
By the induction hypothesis, (w',—) is added to Queue. Since reachability- 
CHECK(i?, t, u) terminates, {w',—) is replaced by NEXT{{w',—),u). As w < u 
(by Theorem 3) and t u is the shortest derivation (i.e, no looping), it is clear 
that (w,—) G NEXT{{w' ,—),u) and hence (w,—) is added to Queue. □ 

The following theorem establishes the decidability of t u. 

Theorem 5. If t and u are two flat terms and i? is a simple flat TRS, it is 
decidable whether t rt or not. 

Proof : Since reachability-check is terminating, it is enough to show that 
REACHABiLiTY-CHECK(i?, t, m) returns ‘yes’ if and only if t u. 

If-case : If t u, it follows by Lemma 7 that {u, — ) is added to Queue. It 
will eventually become the first element and reachability-CHECk(R, t, u) will 
return ‘yes’. 

Only-if-case : REACHABiLiTY-CHECK(i?, t, tt) returns ‘yes’ only if (u,—) is a 
member of Queue. If (m, — ) is a member of Queue, t u by Lemma 4. □ 



5 Inferability of Simple Flat TRSs from Positive Data 

In this section, we establish inductive inferability of simple flat TRSs from pos- 
itive data. 




80 



M.R.K.K. Rao 



Definition 15. Let SF be the set of all simple flat rules, FT be the cartesian 
product of the set of all flat terms (with the same set) and ^ be a semantic 
mapping such that is the relation {{s,t) \ s t} over flat terms. The 

concept defining framework {FT, SF,(!>) is denoted by SFT. 



Lemma 8. The class of rewrite relations defined by simple flat TRSs is an 
indexed family of recursive concepts. 

Proof : By Theorem 5, it is decidable whether t u for any simple flat TRS R 
and flat terms t and u. By Theorem 2, it is decidable whether a TRS is simple 
flat or not. Since S is finite, we can effectively enumerate all simple flat TRSs 
- listing the TRSs of size n before those of size n + 1, where size of a TRS is 
defined as the sum of the sizes of all the left- and right-hand side terms in it. □ 



The following theorem plays the predominant role in proving our main result. 



Theorem 6. The concept defining framework SFT = {FT, SF, <P) has bounded 
finite thickness. 

Proof : Since <P is the rewrite relation, it is obviously monotonic, i.e., <P{Ri) C 
whenever C i? 2 - 

Consider a finite relation S C FT and a TRS R C SF containing at most 
m > 1 rules such that R is reduced w.r.t. S. Let rii be an integer such that 
ni > |s| for every term s in {s G Dsub{t) \ {t, u) G S'} and ri 2 be an integer such 
that ri 2 > \cap{u)\ for every {t,u) G S. 

By Theorem 3, ni > |w'| and ri 2 > |cap('u;)| for every term w' G Dsub{w) 
such that t =>* w =>* u and {t, u) G S. That’s, every redex in t u is of size 
< ni and the cap of every term in t =>* u is of size < ri 2 . Since R is reduced w.r.t. 
S, every rule in R is used in derivations of S. Hence, rii > |?| and ri 2 > |cap(r)| 
for every rule I ^ r G R. 

Since S is finite, there are only finitely many simple flat TRSs containing at 
most m rules of the form I — >■ r such that ni > |l| and ri 2 > |cap(r)| (except 
for the renaming of variables). Therefore, the set {^?(i?) | R is reduced w.r.t. S 
and contains at most m rules} is finite. Hence, the concept defining framework 
{FT, SF,<F) has bounded finite thickness. □ 

From this Theorem, Lemma 8 and Theorem 1, we obtain our main result. 

Theorem 7. For every m > 1, the class of simple flat TRSs with at most m 
rules is inferable from positive data. 



The above result essentially means that there is an inductive inference ma- 
chine that can effectively produce an equivalent simple flat TRS R’ given any 
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enumeration of the set of pairs of flat terms (s,t) such that s reduces to t by the 
target simple flat TRS R. 



6 Discussion 

In this paper, we study inductive inference of term rewriting systems from pos- 
itive data. A class of simple flat TRSs inferable from positive data is defined. 
This class of TRSs is rich enough to include divide-and-conquer programs like 
addition, doubling, tree-count, list-count, split, append, etc. To the best of our 
knowledge, ours is the first positive result about inductive inference of term 
rewriting systems from positive data. The closest publication we know on this 
problem is Mukouchi et.al. [9], which deals with pure grammars like PDOL sys- 
tems. These systems are TRSs with function symbols of arity at most one, and 
a restriction that the left-hand side of each rule is a constant (function of arity 
zero). This stringent restriction makes it very difficult to express even simple 
functions like addition in this class. 

To motivate an open problem for further study, we compare simple flat TRSs 
with linear ly-moded Prolog programs of [8]. The class of linear ly-moded pro- 
grams is the largest class of Prolog programs known to be inferable from positive 
data. The classes of simple flat TRSs and linearly-moded Prolog programs are 
incomparable for the following reasons. 

1. Linearly-moded Prolog programs only capture functions whose output is 
bounded by the size of the inputs. Functions like addition, list-count, split, 
append, quick-sort have such a property. But functions like multiplication 
and doubling are beyond linearly-moded programs as the size of their output 
is bigger than that of the input. The following program computes the double 
of the list (output contains each element of the input twice as often) . 

Moding: double (in, out) 
double([ ], [ ]) ^ 

double([H|T], [H,H|R]) C- double(T,R) 

This program is not linearly-moded w.r.t. the above moding^ as the size of 
the output is twice the size of the input. In contrast, the following TRS for 
computing the double of a given list is simple flat. 

double ([ ])—>■[] 
double([H|T]) [H, H|double(T)] 



This shows that there are functions that can be computed by simple flat 
TRSs but not by linearly-moded Prolog programs. 

^ However, it is linearly-moded w.r.t. the moding double(in, in), in which case this 
program can be used for checking whether a list L2 is the double of another list LI, 
but not for computing the double of a given list. Also see Remark 2, point (b). 
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2. The standard Prolog program for quick-sort is linearly-moded, but it is be- 
yond simple flat TRSs as it involves nesting of defined symbols (partition 
inside the quick-sort and quick-sort inside append). This shows that there 
are functions that can be computed by linearly-moded Prolog programs but 
not by simple flat TRSs. 

In view of this incomparability of the simple flat TRSs and linearly-moded Pro- 
log programs, it will be very useful to work towards extending the frontiers of 
inferable classes of rewrite systems and characterize some classes of TRSs hav- 
ing the expressive power of both simple flat TRSs and linearly-moded Prolog 
programs, and yet inferable from positive data. 
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Abstract. In the context of learning paradigms of identification in the 
limit, we address the question: why is uncertainty sometimes desirable? 
We use mind change bounds on the output hypotheses as a measure of 
uncertainty, and interpret ‘desirable’ as reduction in data memorization, 
also dehned in terms of mind change bounds. The resulting model 
is closely related to iterative learning with bounded mind change 
complexity, but the dual use of mind change bounds — for hypotheses 
and for data — is a key distinctive feature of our approach. We show 
that situations exists where the more mind changes the learner is willing 
to accept, the lesser the amount of data it needs to remember in order 
to converge to the correct hypothesis. We also investigate relationships 
between our model and learning from good examples, set-driven, 
monotonic and strong-monotonic learners, as well as class-comprising 
versus class-preserving learnability. 

Keywords: Mind changes, long term memory, iterative learning, frugal 
learning. 



1 Introduction 

Human beings excel at making decisions based on incomplete information. Even 
in situations where access to larger sets of data would result in a definitely cor- 
rect hypothesis, they might not wait to receive the complete information which 
would allow them to conclude with certainty that their decision is definitely 
the best. They often propose a hypothesis having received little or no evidence, 
knowing that this hypothesis might not be correct and could be revised; they 

^ National ICT Australia is funded by the Australian Government’s Department of 
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arch Council throngh Backing Anstralia’s Ability and the ICT Centre of Excellence 
Program. 
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then gather a few more data and propose another and hopefully better hypothe- 
sis. Making imperfect decisions, but making them now and in the near future, is 
better than having to wait an unknown and possibly very long time to make the 
perfect decision. From a logical point of view, human beings draw uncertain and 
nonmonotonic inferences on the basis of few assumptions when they could make 
deductions on the basis of much larger sets of premises. From the point of view 
of Inductive inference, they accept to change their hypotheses (execute a mind 
change) when they could learn the underlying concept with certainty (without 
any mind change) if they waited to receive a larger set of data. 

Let us illustrate these considerations with an example. Take a class £ of 
languages (nonempty r.e. subsets of the set N of natural numbers) and a text t 
for a member £ of £ (so every positive example — member of L — occurs in the 
enumeration t while no nonmember of L — negative example — occurs in t). A 
learner M (algorithmic device that maps finite sequences of data to hypotheses, 
represented as r.e. indexes) that successfully Ex-identifies L in the limit from t 
(outputs as hypothesis an r.e. index for L in response to all but finitely many 
initial segments of t) will often find that many elements in t are useless at the time 
they are presented to it. For example, assume that £ is the set of all nonempty 
final segments of N. If 7 is the first natural number that occurs in t then M will 
likely conjecture that L = {7,8,9, . . .}. If II is the next natural number that 
occurs in L then 11 will be seen as useless and can be forgotten. Only a natural 
number smaller than 7 will, in case it occurs in t, be used by M . We follow 
the line of research based on limiting the long term memory [3,4,6,13,14]. We 
propose a model where the learner can decide to either remember or forget any 
incoming datum, based on previously remembered data. The learner can output 
conjectures using only the data it has recorded. It could for instance remember 
at most three data, in which case it could output at most four hypotheses — one 
before the first datum is presented, and one after the presentation of each of 
the three data. More generality is achieved by assuming that the learner has an 
ordinal data counter, that can store values smaller than a given ordinal a. When 
it sees a new datum, the learner can either memorize it and decrease the value 
of its data counter, or decide not to use that datum and leave the value of the 
data counter unchanged. The basic idea is that learning incurs a cost whenever 
the learner considers that a piece of information is relevant to the situation 
and should be memorized; we propose to capture this cost by down-counting an 
ordinal counter. This forces learners to make a restrictive or frugal use of data. 

In this paper we start investigating why a learner would accept to give up 
certainty for uncertainty, or a low degree of uncertainty for a higher degree of 
uncertainty. As a measure of uncertainty, we take mind change bounds on the 
output hypotheses (in a model where outputting a new hypothesis requires to 
decrease the value of an ordinal counter), rather than probabilistic accuracy on 
the output hypotheses. The main reason for thinking of uncertainty in terms of 
mind change bounds is that we measure the benefit of additional uncertainty — 
frugal use of data — in terms of ordinals, and it is natural to use a common ‘unit 
of measure.’ Hence we can think of learners as being equipped with two kinds 
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of ordinal counters: a hypothesis counter that measures uncertainty, and a data 
counter that measures the cost of keeping track of information. 

Section 2 describes the framework, which is basically a model of iterative 
learning with at most a mind changes, but where a is interpreted as an up- 
per bound on the number of data that can be remembered and used rather 
than an upper bound on the number of mind changes as to what could be the 
target concept. Section 3 contains the conceptually most interesting result in 
the paper, that determines the exact tradeoff between frugal use of data con- 
sumption and degree of uncertainty for a particular class of languages to be 
learnt. Intuitively, the more mind changes the learner is willing to accept, the 
lesser the amount of data it needs to remember in order to converge to the 
correct hypothesis. This section also exhibits a class of languages that can be 
learnt by learners making frugal use of data if data come from fat texts (where 
positive examples occur infinitely often), but not if data come from arbitrary 
texts. Section 4 investigates some of the relationships between our framework 
and learning paradigms with more restrictive learners. In particular, it is shown 
that frugal use of data is a weaker concept than learnability by consistent and 
strong-monotonic learners, but not weaker than learnability by consistent and 
monotonic learners. Section 5 investigates some of the relationships between our 
framework and learning paradigms with more restrictive learning criteria. In 
particular, it is shown that class-comprising learnability from good examples, 
but not class-preserving learnability from good examples, is a weaker concept 
than frugal use of data. We conclude in Section 6. 

2 Model Description 

The set of rational numbers is denoted Q. We follow the basic model of Induc- 
tive inference as introduced by Gold [8] while applying some notation from our 
previous work [17,18]. CD denotes a recursive set of data. This set can be assumed 
to be equal to N and code, depending on the context, natural numbers, finite 
sequences of natural numbers, finite sequences of ordinals, etc. We usually do 
not make the coding explicit. We denote by j) an extra symbol that represents an 
empty piece of information. Given a set AC, the set of finite sequences of members 
of X is denoted AC*; the concatenation of two members cr, t of X* is represented 
by a 'k T, just written cr * cr in case t is of the form (x). Given a member a of 
{D U {])})*, we denote by cnt(cr) the set of members of D that occur in a. Given 
a sequence t whose length is at least equal to an integer n, and given i < n, t[i] 
represents the initial segment of e of length i and if z < n, t{i) represents its 
(z -I- l)st element. The cardinality of a finite set D is denoted \D\. 

We fix an acceptable enumeration (i^e)egN of the unary partial recursive 
functions over D. For all e G N, We denotes the domain of (pe, and De denotes the 
finite set whose canonical index is e (z. e., He = 0 if e = 0 and De = {xq, . . . , x„} 
if e = -I- ... -I- 2“" for some n,Xo, . . . ,Xn G N with xq < . . . < x„). The symbol 

L refers to a class of r.e. subsets of D, representing the class of languages to be 
learnt; hence U is a class of sets of the form We for e ranging over a particular 
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subset of N. A member cr of (B U {jj})* is said to be consistent in L iff there exists 
L G Si with cnt(CT) Q L. A text for a member i of £< is an enumeration t of LU{tl} 
where each member of L occurs at least once; t is said to be fat if every member 
of L has infinitely many occurrences in t. A learner is a partial computable 
function from (Bu{tl})* into N. Learnability means Ex-identification (in the 
limit): for all L G and for all texts t for L, the learner has to output on 
cofinitely many strict initial segments of t a fixed e G N such that We = L. 

We first define the key notion of memory function. 

Definition 1. A memory function is defined as a partial recursive function h 
from (B U {j)})* into (B U {D})* such that h{{)) = () and for all cr G (B U {jj})*: 

— for all T G (B U {j)})* with cr C r, if h{(j) is undefined then /i(t) is undefined; 

— for all X G Bu{j)}, h(a-kx) is undefined, or equal to h{a), or equal to h{a)-kx. 

Let Z be the range of h except () and given a,r G Z, let R{a, r) hold iff r C ct. 
The ordinal complexity ofh is undefined if {Z, R) is not well founded and is equal 
to the length of {Z, R) otherwise.^ 



Definition 2. Let a learner M be given. 

Given a memory function h, we say that M has h-memory iff M can be 
represented as g o h for some partial recursive function g. 

The data consumption complexity of M is defined and equal to ordinal a iff 
a. is the least ordinal such that M has /i-memory, for some memory function h 
whose ordinal complexity is equal to a. 

When it faces an incoming datum d, a learner that has /i-memory for some 
memory function h will either decide not to memorize d, or memorize d for 
ever — the learner cannot memorize d for some time and then forget it. 

Definition 3. Let a learner M be given. Let Z be the set of all ct G (B U {|})* 
such that M{a) is defined. Given a,r G Z let R{a, t) hold iff there is g G Z with 
r C 77 C CT and M{a) yf M{g). The mind change complexity of M is undefined 
if {Z, R) is not well founded, and equal to the length of {Z, R) otherwise. 

We will often write ‘DG complexity’ for ‘data consumption complexity,’ and ‘MG 
complexity’ for ‘mind change complexity.’ If M Ex-identifies Si and if the mind 
change complexity of M is equal to ordinal a, then M Ex-identifies Si with less 
than a. mind changes, that is, with at most /3 mind changes if a. is of the form 
(3+1 and with no mind change if a = 1. Note that if a is a limit ordinal, then Ex- 
identifying Si with less than a mind changes is not expressible as Ex-identifying 
Si with at most (3 mind changes for some (3. 

^ Assuming that {Z, R) is well founded, and denoting by pR the unique function from 
Z into the class of ordinals such that Pr(x) = sup{pi}(y) + 1 ■. y G X, R{y,x)} for 
all X G X, the length of {Z, R) is defined as the least ordinal not in the range of pR. 
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Remark 4. In the definitions above, and in case (Z, R) is well founded, one can 
furthermore define the effective length of (Z, R) as the least ordinal a such that 
there exists a recursive well-ordering O of Z that is isomorphic to {/3 : /3 < a}, 
with CT IZ T whenever R{u,t)] note that cr, t might be equivalent with respect 
to U if they are not strictly ordered by R. Since Z is a subset of a recursively 
enumerable tree and R is compatible with {{a, r) : ct, t G (D U {jl})*, a D t}, the 
restriction of the Kleene-Brouwer ordering of a tree is always such an ordering, 
and so the effective length of (Z,R) is defined whenever the length of {Z,R) is 
defined. The effective length of {Z, R) yields the notions of effective data con- 
sumption complexity and mind change complexity, in the respective definitions. 
The effective mind change complexity is equivalent to the traditional one [1, 
7,12,10,20] defined by ordinal counters; the noneffective version is used in our 
previous studies of the noneffective setting [17,18]. In [1] it is observed that any 
learner which makes on all texts only finitely many mind changes has an effective 
mind change complexity, that is, the effective mind change complexity is defined 
whenever the noneffective one is defined. The results in this paper hold for both 
the effective and the noneffective versions of mind change and data consumption 
complexities. Note that both the notions of data consumption complexity and of 
mind change complexity are defined ‘externally,’ without requiring the learner 
to explicit update an ordinal counter, hence avoiding the issues related to pro- 
grams for the ordinals. More precisely, these externally defined notions provide a 
lower bound on the ordinal given by a learner that explicitly updates an ordinal 
counter, using a particular program for the ordinals. 



Definition 5. A learner whose MC complexity is defined is said to be confident. 

A learner whose DC complexity is defined is said to be frugal. 

In the literature, a confident learner is usually defined as a learner that 
converges on any text, whether this text is or is not for a member of the class C 
of languages to identify in the limit. We use the term ‘confident’ in the definition 
above relying on the fact (see [1]) that Ex-identification with a bounded number 
of mind changes if equivalent to Ex-identification by a confident learner. 



3 DC Complexity Versus MC Complexity 

We first state a few obvious relationships. 

Property 6. If the DC complexity of a leaner M is defined and equal to ordinal 
a, then the MC complexity of M is defined and at most equal to a +1. 

Remember the definition of an iterative learner: 

Definition 7. A learner M is said to be iterative iff it is total and there exists 
a total recursive function ^ : N x (CD U {]]}) — >■ N such that for all a G {DC {]!})* 
and X G D U {]]}, M{a * x) = g{M{a),x). 
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Coding data into hypotheses immediately yields the following properties. 

Property 8. Let a nonnull ordinal a he given. If some iterative learner whose 
MC complexity is equal to a Ex-identifies I, then some learner whose DC com- 
plexity is at most equal to a Ex-identifies L. 



Property 9. Let an ordinal a he given. If some learner whose DC complexity 
is equal to a Ex-identifies E, then some iterative learner whose MC complexity 
is at most equal to a + 1 Ex-identifies L. 

The notion of data consumption complexity does not collapse to any ordinal 
smaller than the first nonrecursive ordinal: 

Proposition 10 . For all any recursive ordinal a, E can he chosen such that: 

— a is the least ordinal such that some learner whose DC complexity is equal 
to a Ex-identifies E; 

— a is the least ordinal such that some learner whose MC complexity is equal 
to a Ex-identifies E; 

— some learner whose DC and MC complexities are equal to a Ex-identifies E. 

Proof. It suffices to define E as the class of all sets of the form {7 < a : 7 > /?} for 
j3 ranging over the set of ordinals smaller than a given nonnull ordinal a. (What 
is really considered here is a class of sets of codes of ordinals, for an appropriate 
coding of the ordinals smaller than a. We do not explicit the coding in order 
not to clutter the argument.) The optimal learner memorizes a datum iff it is an 
ordinal f) smaller than all ordinals seen so far; the conjecture is {7 < a : 7 > j3}. 

□ 

The next proposition captures the fundamental idea that in some sense, a 
learner can be better and better off in terms of frugal use of data if it is keen to 
accept higher and higher degrees of uncertainty: 

Proposition 11 . For all nonnull k gN, E can he chosen such that: 

— for all h < k, some learner whose DC is co x {k — h) Ex-identifies E with at 
most h mind changes, and no learner whose DC is smaller than to x (k — h) 
Ex-identifies E with at most h mind changes — in particular, some learner 
whose DC is uj X k Ex-identifies E with no mind change, and no learner 
whose DC is smaller than to x k Ex-identifies E with no mind change; 

— some learner whose DC is 2k Ex-identifies E with at most k mind changes, 
and no learner whose DC is smaller than 2k Ex-identifies E. 

Proof. For all n G N, let /„ and Jn be the two subsets of N of cardinality 
n + 3 such that {/„ : n G N} U { J„ : n G N} is a partition of N and for 
all n G N, the members of /„ are smaller than the members of J„, and the 
members of J„ are smaller than the members of /„+i. Hence Iq = {0,1,2}, 
Jo = (3,4, 5}, h = (6, 7, 8, 9}, Ji = (10, 11, 12, 13}, etc. Let a nonnull fc G N be 
given. Define E to be the class of all unions of k disjoint sets of the form /„ or 
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(In \ {min(/„) + to}) U {min( J„) + to} with n G N and to < n + 3. For instance, 
if fc = 1 then C contains {0, 1, 2}, {3, 1, 2}, {0, 4, 2} and {0, 1,5}. 

Note that if a learner M gets a datum x from a member L of C, then M 
can determine the unique n G N such that x G In U Jn, and M knows that 
precisely n + 3 members of L fl (/„ U will eventually occur in the text t for 
L it is presented with. Since k is fixed, it follows that M can Ex-identify C with 
no mind change. Let L G L be such that for all n G N, J„ fl L = 0. Let X 
be the set of all ct G (® U {H})* that are consistent in L. Let Y be the set of 
all a G X such that for all n G N and to < n + 3, if min( J„) + to occurs in 
cnt(cr) then all members of /„ \ {min(/„) + to} occur in cnt((j) before the first 
occurrence of min(J„) + to in cnt(CT). Given Z G {X,Y}, let hz be the (unique) 
memory function such that for all cr G (® U {jj})*, hz{<j) = cnt(a) if a G Z, and 
hz(o') =t otherwise. It is immediately verified that both hx and hy have ordinal 
complexity equal to a; x fc. Finally, if a learner M having /i-memory Ex-identifies 
L with no mind change, then h necessarily extends hy, because for all a GY, M 
has to remember all members of cnt(CT). We conclude that some learner whose 
DC is oj X k Ex-identifies L with no mind change, but no learner whose DC is 
smaller than u x k Ex-identifies L with no mind change. 

Let a learner M have the following property when processing a text t for a 
member L of L. For all n G N, if /„ fl L is nonempty then M remembers (1) the 
first member of /„ U J„ that occurs in t and (2) the (unique) member of Jn that 
occurs in t, if such is the case. Obviously, M can have a data complexity of 2k. 
Moreover, M can Ex-identify L with at most k mind changes: it suffices to make 
the first guess when data from k sets of the form /„ U Jn have been observed, 
conjecturing that L contains /„ if some member of /„ but no member of J„ has 
been observed, and conjecturing that (/„ \ {min(/„) -|- to}) U {min( J„) -|- to} is 
included in L if min( J„) +m has been observed. Clearly, M does not Ex-identify 
£ with less than k mind changes. Finally, it is easy to verify that a learner whose 
data complexity is smaller than 2k cannot Ex-identify £. 

The remaining part of the proposition is proved using a combination of the 
previous arguments. □ 

The notion of data consumption complexity is very restrictive, but becomes 
more general when learning from fat texts: 

Proposition 12. It is possible to choose £ such that: 

— some learner whose DC complexity is to Ex-identifies £ from fat texts; 

— no frugal learner Ex-identifies £ from texts. 

Proof. Suppose that £ consists of the set 2N of even numbers and every set of 
the form D U {2x -\- 1} where \D\ = x and D C 2N. 

Assume that the learner sees a G (2N U{tl})*. Let y = 2|cnt(cr)| -I- 1. The 
learner has to memorize cnt(a) since it cannot exclude that the set to be learned 
is cnt((r) U {y} and the current text is ct x y ★ Thus the learner cannot be 
frugal. 

A learner that learns from fat text can have a memory function h that ex- 
tracts from any a the first odd number 2a; -I- 1 - if there is any - and then the 
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first occurrences of up to x many even numbers after the first occurrence of 
2x+l. The hypothesis of the learner on input a is 2N if h{a) = () and cnt(/i(cr)) 
otherwise. The learner succeeds because fat texts enable to delay the reading of 
the members of D after it has seen 2x + 1, whenever the set to be learnt is of 
the form D U {2x + 1}. □ 

The next proposition gives other relationships between data consumption 
complexity, mind change complexity, and learning from fat texts. 

Proposition 13. For all recursive ordinals a, L can he chosen such that: 

— some learner whose DC and MC complexities are equal to a Ex-identifies L; 

— some nonfrugal learner whose MC complexity is equal to 1 Ex-identifies E; 

— some learner whose DC complexity is equal to a+1 and whose MC complexity 
is equal to 1 Ex-identifies E on fat texts; 

— no frugal learner whose MC complexity is smaller than a Ex-identifies L. 

Proof. Let a recursive ordinal a be given. Suppose that L consists of the fol- 
lowing, for any n > 0, nonempty decreasing sequence (oi, . . . , «„) of ordinals 
smaller than a, and increasing sequence (qi, . . . , qn) of rational numbers: 

— all nonempty initial segments of {{cxi, qi ), . . . , (on, qn))', 

— all rational numbers at least equal to 

A learner that only remembers sequences of the form ((oi, gi), . . . , (a„, g„)) and 
outputs an index for {((oi, gi), . . . , (a„, g„))} U {g € Q : g > g„} for the longest 
such sequence, will clearly have data consumption complexity equal to a, and 
can Ex-identify L with less than a mind changes, but cannot do any better. 

A learner that only remembers sequences of the form ((oi, gi), . . . , (a„, g„)) 
and rational numbers smaller than any rational number remembered before will 
wait till both (a',q) and g are observed for some a' and g, thanks to which it 
can Ex-identify L with no mind change. On the other hand, considering texts 
that start with decreasing sequences of rational numbers that are long enough, 
it is easily verified that this learner has undefined data consumption complexity. 

On fat texts, a learner can Ex-identify L with no mind change by only re- 
membering sequences of the form ((ai, gi), . . . , (a„, g„)), followed by a rational 
number that also occurs last in the (necessarily) longest remembered sequence 
of the previous form. Such a learner will clearly have DC complexity equal to 
G: -t- 1. 

The last claim of the proposition is easily verified. □ 

When £j is an indexed family, it is sometimes essential to have a partial 
memory function as opposed to a total one: 

Proposition 14. It is possible to choose E such that: 

— L is an indexed family; 

— there exists a frugal learner that Ex-identifies C; 

— for all total computable memory functions h and frugal learners M , if M 
has h-memory then M does not Ex-identify L. 
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Proof. Define C to be the set of all sets of the form {(a;,0), . . . , {x,y)} where 
x,y and there exists e < x with ipe{x) being defined and at least equal to y. 
Clearly, L is an indexed family. Define a learner M as follows. Supposed that M 
has memorized ((x, t/i ),..., (x, j/fc)) for some k,yi, . . . ,yk G N with yi < . . . < y^ 
and is presented with (x,y). If y < yk then M discards (x,y). If y > yk then M 
tries to compute (fo{x ), . . . , ipx{x). In case one computation eventually converges 
and its output is at least equal to y then M remembers (x,y); otherwise M 
discards (x,y). The hypothesis output by M is {(x, 0), . . . , (x, y)} with (x,y) 
being the pair remembered last. It is easily verified that M has data consumption 
complexity bounded by w x w, and that M Ex-identifies C. 

Let a total memory function h and a frugal learner M be such that M has 
ft,-memory. For all x G N, let h{x) be the least y > 0 such that 

/i((x,0),...,(x,y)) = ((x,0),...,(x,y- 1)). 

Let e be an index for h. Put: 

Li = {(e,0),...,(e,/J(e))} and La = {(e, 0), . . . , (e, /J(e) - 1)}. 

Then L\ and La both belong to L. But it follows immediately from the definition 
of h{x) that M will not correctly converge on both (e, 0), . . . , (e, . . . (text 

for Li) and (e, 0), . . . , (e, h{e) — 1), jl, . . . (text for La). □ 



4 Relationships with Some Restricted Learners 

In this section, we will investigate how some of the usual restrictions on learn- 
ers [19] affect the notion of data consumption complexity. Whether it can be 
enforced that every hypothesis is consistent with the data seen so far (consis- 
tency), whether it can be enforced that every hypothesis extends the previous 
hypothesis (monotonicity), whether it can be enforced that every hypothesis 
equals the previous hypothesis H in case all data seen so far are consistent 
with H (conservativeness), are among the questions that have received a lot of 
attention, see for example [2,11,13,16,19,21]. 

Definition 15. A learner M is said to be consistent iff for all L G L and 
(T G (L U {ft})*, M{a) is the index of a set that contains all data that occur in <j. 



Definition 16. A learner M is said to be conservative iff for all a G (T> U {ft})* 
and members x of D U {])}, if M{a) is defined and is the index of a set that 
contains x then M(a-kx) is defined and equal to M{a). 



Definition 17. Let a learner M be given. 

(a) M is said to be strong-monotonic iff for all L G L, texts t for L and i,j G N 
with i < j, if M{t[i]) and M{t[j]) are both defined then WM(t[i]) ^ WM(t[j])- 

(b) M is said to be monotonic iff for all L G L, texts t for L and i,j G N with 
i < j, if M{t[i\) and M{t[j\) are both defined then WM{t[i]) H L C WM(t[j]) D L. 
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Definition 18. A learner M is said to be set-driven iff for all a,r G (D U {jj})*, 
if cnt(CT) is equal to cnt(r) then M{a) = M{t). We write M{D) for M{a), for 
any member cr of (B U {ft})* with cnt(cr) = D. 



Property 19. A set-driven learner M Ex-identifies E iff there is a recursive 
function f mapping {canonical indices of) finite sets to r.e. indices such that for 
all L G L, there exists a finite subset D of L such that for all finite subsets E of 
L> if D C E then f{E) = f{D) and f{D) is an r.e. index of L. 

Monotonicity and strong monotonicity cannot be enforced by bounds on the data 
consumption complexity beyond the trivial bounds 0 for strong monotonicity 
and 1 for monotonicity (the trivial bounds are obtained from the fact that mind 
change complexity 1 enforces strong monotonicity and 2 enforces monotonicity) : 

Example 20. If L consists of 2N and, for all a; G N, the finite sets 

{0, 2, . . . , 2x} U {2a; + 1} and (0,2,..., 2a;} U (2a; + 1, 2x + 2, 2a; + 3}, 
then some learner whose DC complexity is 2 Ex-identified E, but no monotonic 
learner Ex-identifies E. 

Consistency, strong-monotonicity, and frugality guarantee that DC complexity 
is defined: 

Proposition 21. If some consistent, strong-monotonic and confident learner 
Ex-identifies E then some frugal learner Ex-identifies E. 

Proof. Let iV be a consistent, strong- monotonic and confident learner that Ex- 
identifies E. We define a memory function h and a learner M having /i-memory 
as follows. Put M{{)) = A^(0). Let ct G (B U {D})* and a; G B U {jj} be given. 
The definition of h(a-kx) and M(a-kx) is by cases. 

Case 1: N{t) is defined for all initial segments r of h{a)-kx and N{h{a)*x) 
is equal to N{h{a)). Then h{a -k x) = h{a). 

Case 2: N{t) is defined for all initial segments t of h{a) kx but N{h{a)kx) 
is distinct from N{h{a)). Then h{akx) = h{a)kx and M{akx) = N{h{akx)). 
Case 3: Otherwise, both h{a k x) and M{akx) are undefined. 

Note that Case 3 happens if and only if ct * a; is not consistent in E. Let t 
be an infinite enumeration of members of B U (j)} such that for all i G N, t[i] is 
consistent in E and assume that is defined. Then M{t[i]) is defined for 

all z G N, and since the mind change complexity of N is defined, {h{t\i]) : z G Nj 
is finite. Hence the mind change complexity of M is also defined. 

Using the facts that N is consistent and strong- monotonic, and Ex-identifies 
E, it is easily verified that: 

— M itself is consistent and strong- monotonic; 

— for all L G E, texts t for L and z G N, M{t[i\) C L. 

We infer that M converges on any text for L, for any member L of E. □ 

If the strong-monotonicity requirement is lifted, then there is no longer any 
guarantee that the data consumption complexity is defined: 
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Proposition 22. It is possible to choose C such that: 

— some learner whose MC complexity is equal to 1 Ex-identifies E; 

— some confident, consistent, monotonic learner Ex-identifies L; 

— some confident, consistent, conservative learner Ex-identifies L; 

— no frugal learner Ex-identifies E . 

Proof. Let -C consist of the set 2N of even numbers and all sets of the form 
D\j{2x + 1} where D C {2, 4, . . .} and \D\ = x. A learner that waits until either 
0 or a number of the form 2x -\- 1 together with x members of {2,4, . . .} have 
appeared can obviously Ex-identify L with no mind change. 

Define a learner M as follows. Let a € (D U {tt})* be given and choose the 
appropriate case. 

Case 1 : If all members of cnt(cr) are even then M{a) = (0, 2,4, . . .}. 

Case 2: If cnt(cr) contains a unique number of the form 2a; + 1 and less than 
X even numbers then M{a) = {2a; + 1, 0, 2, 4, . . .}. 

Case 3: If cnt((r) contains a unique number of the form 2a; + I and precisely 
X even numbers then M{a) = cnt(cr). 

Case 4: Otherwise M{a) is undefined. 

Note that: 

— for all texts t for {0,2,4, . . .} and i G N, M{t[i]) = {0,2,4, . . .}; 

— for all members L of L of the form L = {2a; + 1, yi, j/ 2 j ■ ■ ■ ^ Vx} and for all 
texts t for L, M will output in response the longer and longer initial segments 
of t only the hypotheses {0, 2, 4, . . .}, {2a; + 1, 0, 2,4,.. .} and L (maybe not 
all of them), in that order, and Ln{0, 2, 4, . . .} C Ln{2a;+1, 0, 2, 4, . . .} C L. 

Hence M is monotonic, and it is easily verified that M is consistent, has a defined 
mind change complexity, and Ex-identifies L. 

Define a learner N as follows. Let cr G (2) U {jj})* be given. The definition of 
N{a) is by cases. 

Case 1 : If all members of cnt(cr) are even then N{a) = {0, 2,4, . . .}. 

Case 2: If cnt(CT) contains a unique number of the form 2a; -I- I and at most 
X even numbers then N(a) = cnt(cr). 

Case 3: Otherwise N{a) is undefined. 

It is immediately verified that TV is a consistent and conservative learner that 
Ex-identifies L, and whose mind change complexity is defined and equal to w-l- 1. 

Let P be a learner that has h memory, for some memory function h whose 
ordinal complexity is defined. Let a sequence (CT)„gN of finite sequences be defined 
as follows. Put CTo = (). Let a nonnull n G N be given, and assume that has 
been defined for all m < n, but cr„ has not been defined yet. If there exists a 
sequence r consisting of two members of {6n, 6n-|-2, 6n-|-4} such that /i(cr„_i*r) 
is distinct from h{an-i) then = cr„_i for such a sequence t. Otherwise 
o'm = o'n for all m> n. Since the ordinal complexity of h is defined, there exists 
a least p G N with (Jp_i = ap. Note that ap contains precisely 2p even numbers. 
Then ei = ap* (6p, 4p -|- 1, j), S, . . .) and 62 = ap* (6p -I- 2, 4p -|- 1, tt, tt, ■ . .) are two 
texts for two distinct members of L but for all m > 2p, P{ei[m\) = P(e 2 [m]), 
hence P fails to Ex-identify L. □ 
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5 Relationships with Some Learning Criteria 

Frugality turns out to be related to learnability from good examples. There are 
two main notions of learnability from good examples [5,9,15]: 

Definition 23. L is said to be class- comprisingly learnable from good examples 
iff there is a numbering of uniformly recursively enumerable sets contain- 

ing all members of L, a recursive function g mapping r.e. indices to canonical 
indices of finite sets and a set-driven learner M mapping canonical indices of 
finite sets to r.e. indices such that for all e G N: 

— ii He G then Dg(g) C 

— He G if, Dg(e) C if C iJg and E is finite then Hm{e) = He- 

Moreover, if if = {iio: . . .} then if is said to be class-preservingly learnable 

from good examples. 

Example 24. [9, Theorem 1(b)] It is possible to choose if such that if is con- 
fidently learnable, but not learnable from good examples. An example of such a 
class consists of the following subsets of N x N: {(i, 0), {i, 1), . . .} for all i and, for 
a given recursive repetition-free enumeration of the halting-problem, 

each set {(jf, 0), (jf, 1), . . . , {ji, k)} with k<i. 

So confidently learnable classes may fail to be learnable from good examples, 
but this result does not carry over to frugally learnable classes: 

Proposition 25. Suppose that some frugal learner Ex-identifies if. Then if is 
class- comprisingly learnable from good examples. More precisely, there is a super- 
class of if which is class-preservingly learnable from good examples with respect 
to a repetition-free enumeration of this class. 

Proof. Let a memory function h and a frugal learner M be such that M has h- 
memory and M Ex-identifies if. First one introduces some useful concepts. Let t 
be any recursive sequence over N U {D} that contains infinitely many occurrences 
of each member of N U {]]}. Given F C N, let tp be such that for all i G N, 
tp{i) = t{i) if t{i) G F and tp{i) = j] otherwise. Given F C N, define ap to be 
the shortest initial segment of tp such that h{ap) is defined and for all initial 
segments t of tp that extend tp{i), h{r) is defined and equal to h{ap); if such 
an initial segment does not exist then ap is undefined. Intuitively, this means 
that whenever ap exists, M does not remember any member oitp that occurs 
in tj’ beyond ap, and therefore there isnoa;GFU{^} such that M remembers 
the last occurrence of x in ap * x. Note that the partial function that maps a 
finite subset D of N such that ap exists, to ap itself, is partial recursive. 

We now define the enumeration of the superclass which is class-preservingly 
learnable from good examples. We let be a recursive repetition-free enu- 

meration of all defined sequences ap for finite sets D; Since M is a frugal learner, 
ap is defined at least for all those D which are consistent with L. Having this, 
we define Hi to be the union of cnt (rji) with 

{x G WM(rji) — cnt(? 7 i) : ap exists and ap = ry where D = cnt{r]i) U {a;}} 
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and we verify the following: (a) L C : i g N}; (b) the enumeration 
is repetition-free; (c) {Hi : i G N} is class-preservingly learnable from good 
examples via the function G mapping the set i to the set G{i) = cnt(? 7 i) of good 
examples of Hi and the learner N mapping each finite set E to the index i of the 
string rji equal to ue whenever ge exists; N is undefined if ue does not exist. 

For (a), consider any L G E and the sequence gl- Note that gl = ud for 
the finite set D = cnt (ctl). Therefore there is an index i with gl = rji. Since M 
converges on to an index of L it holds that = L. Every x G L satisfies 

that E = D U {x{ is consistent with L and thus there ge exists. Furthermore, 

= cr_D since D = cnt((JL) G E C L. Thus Hi = L. 

For (b), consider any i,j with Hi = Hj. Then r]i and r]j are both prefixes 
of tui, say ? 7 i ^ r]j. It follows that h{r]j) = h{rji) since otherwise there is a 
X G cnt(? 7 j)— cnt(? 7 i) which would then be in Hj — Hi in contradiction to Hi = Hj. 
Since the enumeration (? 7 fc)fegN is repetition-free, i = j and the enumeration 
(Fffe)fcgN is also repetition-free. 

For (c), consider any f G N and Hi. It follows directly from the definition 
of Hi that G(z) C Hi. Furthermore, every finite D Q Hi with G(i) G D C Hi 
satisfies that rji is a prefix of to and that h{r]ix) = h{r]i) for all a: G D U {jl}. 
Thus gd = Vi and N{D) is defined and equal to i. So IV is a learner which 
infers {Hi : i G N} class-comprisingly from good examples with respect to the 
repetition-free enumeration (iJi)igfij. □ 

Note that one can replace every iterative learner for E by an equivalent learner 
M that converges on every text for a finite set. Then M satisfies the property 
that Gd exists for every g consistent in E. This property was the only essential 
condition used in the proof which follows from the definition of a frugal learner 
but not from the definition of an iterative learner. Thus one obtains: 



Corollary 26. If E is iteratively learnable then E is class-comprisingly learn- 
able from good examples. 

By [9, Theorem 2] that there are classes which are iteratively learnable as they 
consist of finite sets only, but which are not class-preservingly learnable from 
good examples. For an example of such a class, consider the class consisting of 
all {x{ with X G K' and {x, y} with x < y; K' is the halting problem for compu- 
tations relative to K. Still, this class does not have a class-preserving iterative 
learner. The next example shows that there is also a class-comprisingly frugally 
learnable class which is not class-preservingly learnable from good examples. 

Proposition 27. There is a class which can be learned with data consumption 
complexity 3, but which is not class-preservingly learnable from good examples. 

Proof. If a class E can be learned class-preservingly from good examples then 
there is a total function Lpi such that ■ ■ ■ is an enumeration of 

E (not containing any nonmember of C) and there is a recursive function pj 
mapping every e to the finite subset of bF^,(e) such that no L G E satisfies 

d-b(pj{e) G L Gl 
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The goal is to construct a class Z which violates this constraint. The class tL 
is given by an indexed family which consists of 0 and the following sets: 

— L,j = for all i,j G N; 

~ Fij = .fU Si j + 1)} whenever there is a least pair (eij,Sij) G 

such that ipj{eij) is defined, 

0 tZ F)^^ ^ {(b ? (b 1)5 ■ • ■ 5 (b J? ; 

s-i,j < Sij, and all these facts can be verified in time Sij. If such a pair 
(ei,j,Sij) does not exist then no set Fij is added to 

Note that whenever ... is a class-preserving enumeration of H and 

(pj maps every e G N to the canonical index of a strict finite subset of IT,^.(e), 
then 6ij must exist and p ~ Indeed, p, belonging to F, is 

equal either to Lij or to Fi j. But if p were equal to Fi j then one would 

derive 

HVt(eij) “ H {(b J) SiJ F 1)} C U {(b J, Sij F 1)} = Wlp^(eiJ)^ 

which is impossible. Hence Wip.(^eij) ~ ^i,j- Furthermore, 

Then F has no class-preserving learner from good examples which uses the 
numbering (IF,, 3 i(fe))fceN and the function (pj to compute the set of good examples 
from the index since such a learner would have to map Fi j to Fi j and to Li j 
at the same time. So F is not class-preservingly learnable from good examples. 

A class-preserving learner for L has data consumption complexity 3. It is 
initialized with 0 and it memorizes a new datum {i,j, k) iff: 

— all previously memorized data k') satisfy i' = i, j' = j and k' < k; 

— either {i,j, k) is the first datum to be memorized, 
or the set Fij exists and k = Sij F 1, 

or the set Fij exists and the previous hypothesis is Fij and k > Sij F 1. 

Although it is impossible to check whether Fij exists one can check whether Fij 
exists and Sij < k. 

When the leaner memorizes anew datum {i,j, k), either Fij exists and Sij + 1 
is equal to k, in which case the learner outputs Fij, or the learner outputs Lij. 

It is easy to see that the learner is correct and memorizes a maximum amount 
of data when the hypotheses Li j, Fij, Li j are output; a new datum is then 
memorized for each hypothesis. □ 

6 Conclusion 

Traditional models of computation and learning have looked at intrinsic uncer- 
tainty (for example, conditions under which a class of languages is learnable 
with no less than a given number of mind changes), but do not conceive of 
‘self-imposed’ uncertainty as a desirable feature. These models do not capture 
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the behaviour of human beings in most real-life decisions: even when he knows 
that the degree of uncertainty will eventually decrease, a decision maker accepts 
the extra uncertainty of the moment and takes action. In this paper we have 
proposed a preliminary attempt to formally justify why uncertainty might be 
desirable. The model is crude, in particular because the notion of data con- 
sumption complexity is very restrictive; it will be generalized in future work. We 
also intend to extend our work from the numerical setting to the logical setting, 
expecting to benefit from the expressive power of the latter. 
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Abstract. Different formal learning models address different aspects of 
learning. Below we compare learning via queries — interpreting learning 
as a one-shot proeess in which the learner is required to identify the target 
concept with just one hypothesis — to Gold-style learning — interpreting 
learning as a limiting process in which the learner may change its mind 
arbitrarily often before converging to a correct hypothesis. 

Although these two approaches seem rather unrelated, a previous study 
has provided characterisations of different models of Gold-style learning 
(learning in the limit, conservative inference, and behaviourally correct 
learning) in terms of query learning. Thus under certain circumstances 
it is possible to replace limit learners by equally powerful one-shot learn- 
ers. Both this previous and the current analysis are valid in the general 
context of learning indexable classes of recursive languages. 

The main purpose of this paper is to solve a challenging open prob- 
lem from the previous study. The solution of this problem leads to an 
important observation, namely that there is a natural query learning 
type hierarchically in-between Gold-style learning in the limit and be- 
haviourally correct learning. Astonishingly, this query learning type can 
then again be characterised in terms of Gold-style inference. 

In connection with this new in-between inference type we have gained 
new insights into the basic model of conservative learning and the way 
conservative learners work. In addition to these results, we compare sev- 
eral further natural inference types in both models to one another. 



1 Introduction 

Undeniably, there is no formal scheme spanning all aspects of human learning. 
Thus each learning model analysed within the scope of learning theory addresses 
only special facets of our understanding of learning. For example, Angluin’s [2,3] 
model of learning with queries focusses learning as a finite process of interaction 
between a learner and a teacher. The learner asks questions of a specified type 
about the target concept and the teacher answers these questions truthfully. 
After finitely many steps of interaction the learner is supposed to return its sole 
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hypothesis — correctly describing the target concept. Here the crucial features of 
the learner are its ability to demand special information on the target concept 
and its restrictiveness in terms of mind changes. Since a query learner is required 
to identify the target concept with just a single hypothesis, we refer to this 
phenomenon as one-shot learning} 

In contrast to that, Gold’s [7] model of identification in the limit is con- 
cerned with learning as a limiting process of creating, modifying, and improving 
hypotheses about a target concept. These hypotheses are based upon instances 
of the target concept offered as information. In the limit, the learner is supposed 
to stabilize on a correct guess, but during the learning process one will never 
know whether or not the current hypothesis is already correct. Here the ability 
to change its mind is a crucial feature of the learner. 

[11] is concerned with a first systematic analysis of common features of these 
two seemingly unrelated approaches, thereby focussing on the identification of 
formal languages, ranging over indexable classes of recursive languages, as tar- 
get concepts, see [1,9,14]. Characterising different types of Gold-style language 
learning in terms of query learning has pointed out interesting correspondences 
between the two models. In particular, the results in [11] demonstrate how learn- 
ers identifying languages in the limit can be replaced by one-shot learners without 
loss of learning power. That means, under certain circumstances the capabilities 
of limit learners are equal to those of one-shot learners using queries. 

The analysis summarized in this paper has initially been motivated by an 
open problem in [11], namely whether or not the capabilities of query learners 
using superset queries in Godel numberings are equal to those of behaviourally 
correct Gold-style learners using Godel numberings as their hypothesis spaces. 
Below we will answer this question to the negative, which will lead to the fol- 
lowing astonishing observation: there is a natural inference type (learning via 
superset queries in Godel numberings) which lies in-between Gold-style learning 
in the limit from text and behaviourally correct Gold-style learning from text in 
Godel numberings.^ Up to now, no such inference type has been known. 

This observation immediately raises a second question, namely whether there 
is an analogue of this query learning type in terms of Gold-style learning and 
thus whether there is also a Gold-style inference type between learning in the 
limit and behaviourally correct learning. Indeed such a relation can be observed 
with conservative inference in Godel numberings by learners using an oracle for 
the halting problem; see [13] for further results on learning with oracles. 

Studying such relations between two different approaches to language learn- 
ing allows for transferring theoretically approved insights from one model to the 
other. In particular, our characterisations may serve as ‘interfaces’ between an 
analysis of query learning and an analysis of Gold-style learning through which 
proofs on either model can be simplified using properties of the other. 

^ Most studies on query learning mainly deal with the efficiency of query learners, 
whereas below we are only interested in qnalitative learnability resnlts in this context. 
^ That means that the capabilities of the corresponding learners lie in-between. Con- 
cerning the notions of inference, see [7,1,14] and the preliminaries below. 




Comparison of Query Learning and Gold-Style Learning 101 



Most of our proofs provided below make use of recursion-theoretic concep- 
tions, thus in particular providing a second quite accessible example — after the 
one in [11] — for a class identifiable by a behaviourally correct learner in Godel 
numberings but not identifiable in the limit. The interesting feature of these 
classes is that they are defined without any diagonal construction — very unlike 
the corresponding classes known before, see for instance [1]. 

Comparing our results to a result from [13] points out a related open prob- 
lem in Gold-style learning: Note that, by [13], an indexable class is learnable 
conservatively using an oracle for the halting problem and a uniformly recursive 
hypothesis space if and only if it is learnable in the limit. In contrast to that, 
we show that conservative learners using an oracle for the halting problem and 
a Godel numbering as a hypothesis space are more capable than limit learners. 
This implies that, in the context of conservative inference, oracle learners may 
benefit from using a Godel numbering instead of uniformly recursive numbering 
as a hypothesis space. Now the related open problem is: do conservative learners 
(without the help of oracles) also benefit from Godel numberings instead of uni- 
formly recursive numberings? Though this question is quite natural, it has not 
been discussed in the literature so far. Unfortunately, we can provide an answer 
only for a special case: if a learner is required to work both conservatively and 
consistently on the relevant data, Godel numberings do not increase the capabil- 
ities when compared to uniformly recursive hypothesis spaces. Additional results 
below relate several further natural inference types in both models to each other. 

2 Preliminaries 

Familiarity with standard recursion theoretic and language theoretic notions is 
assumed, see [12,8]. Subsequently, let A be a finite alphabet with {a, 6} C S. 
A word is any element from E* and a language any subset of E* . The comple- 
ment L of a language L is the set E* \ L. Any infinite sequence t = (wi)igN with 
{wi I i G N} = L is called a text for L. Then, for any n G N, denotes the 
initial segment (wq, . . . , w„) and contentftn) denotes the set {wq, . . . , w„}. 

A family (Ai)igN of languages is uniformly recursive {uniformly r. e.) if there 
is a recursive (partial recursive) function / with Ai = {li; G E* \ f{i, w) = 1} for 
all i G N. A family (Ai)^^^ is uniformly 2-r. e., if there is a recursive function g 
with Ai = {re G E* \ g{i,w,n) = 1 for all but finitely many n} for all f G N. 
Note that for uniformly recursive families membership is uniformly decidable. 

Let C be a class of recursive languages. C is said to be an indexable class 
of recursive languages (in the sequel we will write indexable class for short), if 
there is a uniformly recursive family (Li)igN of Eind only the languages in C. 
Such a family will subsequently be called an indexing of C. 

A family (Ti)jgN of finite languages is recursively generable, if there is a 
recursive function that, given i G N, enumerates all elements of Ti and stops. 

In the sequel, let (phe& Godel numbering of all partial recursive functions and 
the associated Blum complexity measure, see [5] for a definition. For i, n G N 
we will write Lpi[n\ for the initial segment ((/?i(0), . . . , ipi{n)) and say that ipi[n] is 
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defined if all the values . . , (pi{n) are defined. For convenience, is 

always considered defined. Moreover, let Tot = {i G N | is a total function} 
and iF = {i G N I ipi{i) is defined}. The family (IFi)igN of languages is given by 
Wi = {w* I is defined} for all z G N, where is some fixed effective 

enumeration of S* without repetitions. Moreover, we use a bijective recursive 
function coding a pair (x, y) with a:, y G N into a number {x, y) G N. 

2.1 Language Learning via Queries 

In the query learning model, a learner has access to a teacher that truthfully 
answers queries of a specified kind. A query learner M is an algorithmic device 
that, depending on the reply on the previous queries, either computes a new 
query or returns a hypothesis and halts, see [2]. Its queries and hypotheses are 
coded as natural numbers; both will be interpreted with respect to an underlying 
hypothesis space. When learning an indexable class C, any indexing TL = 
of C may form a hypothesis space. So, as in the original definition, see [2], when 
learning C, M is only allowed to query languages belonging to C. 

More formally, let C be an indexable class, let L G C, let TL = (Li)igN be 
an indexing of C, and let M he & query learner. M learns L with respect to TL 
using some type of queries if it eventually halts and its only hypothesis, say z, 
correctly describes L, i.e., Li = L. So M returns its unique and correct guess z 
after only finitely many queries. Moreover, M learns C with respect to TL using 
some type of queries, if it learns every L' G C with respect to TL using queries of 
the specified type. In order to learn a language L, a query learner M may ask: 

Membership queries. The input is a string w and the answer is ‘yes’ or ‘no’, 
depending on whether or not w belongs to L. 

Restricted superset queries. The input is an index of a language L' G C. The 
answer is ‘yes’ or ‘no’, depending on whether or not L' is a superset of L. 
Restricted disjointness queries. The input is an index of a language L' G C. The 
answer is ‘yes’ or ‘no’, depending on whether or not L' and L are disjoint. 

MemQ, rSupQ, and rDisQ denote the collections of all indexable classes C 
for which there are a query learner M' and a hypothesis space TL' such that M' 
learns C with respect to TL' using membership, restricted superset, and restricted 
disjointness queries, respectively. In the sequel we will omit the term ‘restricted’ 
for convenience and also neglect other types of queries analysed in the litera- 
ture, see [2,3]. Obviously, superset and disjointness queries are in general not 
decidable, i. e. the teacher may be non-computable. 

Note that learning via queries focusses the aspect of one-shot learning, i. e., it 
is concerned with scenarios in which learning eventuates without mind changes. 

Having a closer look at the different models of query learning, one easily 
finds negative learnability results. Some examples in [11] point to a drawback 
of Angluin’s query model, namely the demand that a query learner is restricted 
to pose queries concerning languages contained in the class of possible target 
languages. That means there are very simple classes of languages, for which any 
learner must fail just because it is barred from asking the ‘appropriate’ queries. 
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To overcome this drawback, it seems reasonable to allow the query learner to 
formulate its queries with respect to any uniformly recursive family comprising 
the target class C. An extra query learner (see also [10,11]) for an indexable class 
C is permitted to query languages in any uniformly recursive family (L')igN com- 
prising C. We say that C is learnable with extra superset (disjointness) queries 
respecting (L')jgrj iff there is an extra query learner M learning C with re- 
spect to using superset (disjointness) queries concerning (L')igN. Then 

rSupQ^g^ (rDisQ^^^) denotes the collection of all indexable classes C learnable 
with extra superset (disjointness) queries respecting a uniformly recursive family. 

It is conceivable to permit even more general hypothesis spaces, i. e., to de- 
mand a more potent teacher. Thus, let rSupQ^ ^ (rDisQ^ ^ ) denote the collection 
of all indexable classes which are learnable with superset (disjointness) queries 
respecting a uniformly r.e. family. Obviously, each class in rSupQ^ ^ (rDisQ^. ^ ) 
can be identified respecting our fixed numbering (ITi)^^^. Similarly, replacing 
the subscript ‘r. e.’ by ‘2-r. e.’, we consider learning in a uniformly 2-r. e. family. 

Note that the capabilities of rSup Q-leaxners {rDisQ-leaxners) already in- 
crease with the additional permission to ask membership queries. Yet, as has 
been shown in [11], combining superset or disjointness queries with membership 
queries does not yield the same capability as extra queries do. For convenience, 
we denote the family of classes which are learnable with a combination of super- 
set (disjointness) and membership queries by rSupMemQ {rDisMemQ). 

2.2 Gold-Style Language Learning 

Let C be an indexable class, % = (Li)igN any uniformly recursive family (called 
hypothesis space), and L € C. An inductive inference machine (IIM) M is an 
algorithmic device that reads longer and longer initial segments of a text and 
outputs numbers M(t„) as its hypotheses. An IIM M returning some i is con- 
strued to hypothesize the language Li. Given a text t for L, M identifies L from 
t with respect to H in the limit, if the sequence of hypotheses output by M, when 
fed t, stabilizes on a number i (i.e., past some point M always outputs the hy- 
pothesis i) with Li = L. M identifies C in the limit from text with respect to H, 
if it identifies every L' G C from every corresponding text. LimTxty-ec denotes 
the collection of all indexable classes C for which there are an IIM M' and a 
uniformly recursive family %' such that M' identifies C in the limit from text 
with respect toT-C . A quite natural and often studied modification of LimTxLec 
is defined by the model of conservative inference, see [1]. M is a conservative 
IIM for C with respect to T~L, if M performs only justified mind changes, i.e., if 
M, on some text t for some L G C, outputs hypotheses i and later j, then M 
must have seen some element w Li before returning j. An important property 
of conservative learners is that they never hypothesize proper supersets of the 
language currently to be learned. The collection of all indexable classes iden- 
tifiable from text by a conservative IIM is denoted by ConsvTxt^ec- Note that 
ConsvTxtrec C LimTxt^^c [14]. Since we consider learning from text only, we 
will assume in the sequel that all languages to be learned are non-empty. One 
main aspect of human learning modelled in the approach of learning in the limit 
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is the ability to change one’s mind during learning. Thus learning is a process in 
which the learner may change its hypothesis arbitrarily often before stabilizing 
on its final correct guess. In particular, it is undecidable whether or not the 
final hypothesis has been reached, i.e., whether or not a success in learning has 
already eventuated. 

If only uniformly recursive families are used as hypothesis spaces, LimTxtrec 
coincides with the collection of indexable classes identifiable in a behaviourally 
correct manner, see [6]: If C is an indexable class, "H = a uniformly re- 

cursive family, M an IIM, then M is a behaviourally correct learner for C from 
text with respect to ’H, if for each L G C and each text t for C, all but finitely 
many outputs f of M on t fulfil Li = L. Here M may alternate different cor- 
rect hypotheses instead of converging to a single hypothesis. Defining BcTxtrec 
correspondingly as usual yields BcTxtrec = LimTxtrec (a folklore result). 

This relation no longer holds, if more general types of hypothesis spaces are 
considered. Assume C is an indexable class and "H+ = is any uniformly 

r. e. family of languages comprising C. Then it is also conceivable to use as 
a hypothesis space. For I G {Lim, Consv, Be}, ITxt,:,e. denotes the collection 
of all indexable classes learnable as in the definition of ITxtrec, if the demand 
for a uniformly recursive family "H as a hypothesis space is loosened to demand- 
ing a uniformly r. e. family ’H+ as a hypothesis space. Note that each class 
in ITxtr.e. can also be / Ta;t-identified in the hypothesis space (IFi)igN- Interest- 
ingly, Lim Txtrec = Lim Txtr.e. (a folklore result), i. e., in learning in the limit, the 
capabilities of IIMs do not increase, if the constraints concerning the hypothesis 
space are weakened by allowing for arbitrary uniformly r. e. families. In contrast 
to that, for BcTajf-identification, weakening these constraints yields an add-on in 
learning power, i.e., BcTxtrec C BcTxtr.e.- In particular, LimTxtrec C BcTxtr.e. 
and so LimTxt- and BcTxt-leaxning no longer coincide for identification with 
respect to arbitrary uniformly r. e. families, see also [4,1]. 

The main results of our analysis will be comparisons of these inference types 
with different query learning types. For that purpose we will make use of well- 
known characterizations based on so-called families of telltales, see [1]. 

Definition 1. Let be a uniformly recursive family and a family 

of finite non-empty sets. (Ti)igN is a telltale family for (Li)jgN ijf for alli,j G N.' 

1. T,CLi. 

If Ti C Lj C Li, then Lj = Li . 

Telltale families are the best known concept to illustrate the specific differ- 
ences between indexable classes in LimTxtrec, Consv Txtrec, and BcTxtr.e. - Their 
algorithmic structure has turned out to be crucial for learning, see [1,9,4]: 

Theorem 1. Let C be an indexable class of languages. 

1. C G LimTxtrec iff there is an indexing ofC possessing a uniformly r. e. family 
of telltales. 

2. C G Consv Txtrec iff there is a uniformly recursive family comprising C and 
possessing a recursively generable family of telltales. 

3. C G BcTxtr.e. iff there is an indexing of C possessing a family of telltales. 
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3 Hypothesis Spaces in Query Learning 

Concerning the influence of the query and hypothesis spaces in query learning, 
various interesting results have been established in [11]. These reveal a hierar- 
chy of capabilities of query learners resulting from a growing generality of the 
hypothesis spaces. Interestingly, in some but not in all cases, the capabilities of 
superset query learners and disjointness query learners coincide: 

Theorem 2. [11] 

rSupQj.^^ = rDisQ^g^ C rDisQ^^ C rSupQ^^ C rSupQ2_r.e. = f'DisQ2.r.e. ■ 

In [II] it has remained open, whether or not there is an indexable class 
in rSupQ 2 _j;,e, \ rSupQ ^^ — a problem which will be solved below. Moreover, 
[11] studies original superset (disjointness) query learners which are additionally 
permitted to ask membership queries. Their capabilities are in-between those of 
the original learners and extra query learners. 

Theorem 3. [11] (a) rSupQ C rSupMemQ C rSupQ^^^ . 

(b) rDisQ C rDisMemQ C rDisQ^^^ . 

Comparing these results, notice that Theorem 2 analyses the relationship 
between superset and disjointness query learning, whereas Theorem 3 avoids 
corresponding statements. That means, it has remained open, how the inference 
types rSupQ and rSupMemQ relate to rDisQ and rDisMemQ. 

As an answer to this question, we can state that rSupQ and rSupMemQ are 
incomparable to both rDisQ and rDisMemQ , an immediate consequence of the 
following theorem. 

Theorem 4. (a) rSupQ % rDisMemQ . (b) rDisQ % rSupMemQ . 

Proof. We provide the separating indexable classes without a proof. 

(a) The class containing L = {b} and = {a^,b} for A: > 0 belongs to 
rSupQ \ rDisMemQ . 

(b) A class Cb G rDisQ \ rSupMemQ is deflned as follows: for fc G N, let 

Cb contain the languages Lk = {a^b^ I > 0} and LJ. = | j > g}. 

Additionally, if k ^ K, then Cb contains L], j = for all j G N; whereas, 

if k G K, then Cb contains j = {a^6^ | z < or z > <Pk{k) + j} as well as 

Ll - = U for all j G N. □ 

The more challenging open question is whether or not the inference types 
rSupQj. g and rSupQ 2 _^,e. coincide. Interestingly, this is not the case, that means, 
2-r.e. numberings provide a further benefit for learning with superset queries. 

Theorem 5. rSupQ^^ C rSupQ 2 .r.e.- 

Though our current tools allow for a verification of this theorem, the proof 
would be rather lengthy. Since a characterisation of rSupQj. ^ in terms of Gold- 
style learning simplifies the proof considerably, we postpone the proof for now. 




106 S. Lange and S. Zilles 



4 Query Learning and Gold-Style Learning — Relations 

In [11], a couple of relations between query learning and Gold-style learning have 
been elaborated. The following theorem summarizes the corresponding results. 

Theorem 6. [11] (a) rSupQ^^^ = rDisQ^^^ = ConsvTxtrec ■ 

(b) rDisQj- g = LimTxtrec = BcTxt^ec ■ 

(c) rSupQ 2 _ie = rHis(52-re = BcTxt^ e, . 

By Theorems 2 and 5 this implies LimTxtrec C rSupQ^ ^ C BcTxt^,e,, i. e., 
we have found a natural type of learners the capabilities of which are strictly 
between those of Lzm Tzi-learners and those of BcTzGlearners. This raises the 
question whether the learning type rSupQ^ ^ can also be characterised in terms 
of Gold-style learning. This is indeed possible if we consider learners which have 
access to some oracle. In the sequel the notion GonsuTzG.e. [.ff] refers to the 
collection of indexable classes which are learnable in the sense of ConsvTxt^ ^.j 
if also iL-recursive learners are admitted, see [13]. 

Theorem 7. rSupQ^^ = ConsvTxt^,f,\K] . 

Proof. First, we prove rSupQ^^ C ConsvTxt^,e\K], For that purpose assume 
C is an indexable class in rSupQ^^, . Let M be a query learner identifying C in 
(Wi)igN and assume wlog that each hypothesis ever returned by M corresponds 
to the intersection of all queries answered with ‘yes’ in the preceding scenario.^ 
Let L G C, t a text for L. Let M'{to) be an index of the language content {to). 
Given n > 1, a learner M' works on input as follows: M' simulates M for n 
steps of computation. Whenever M asks a superset query i, M' transmits the 
answer ‘yes’ to M, if content{tn) C Wi, the answer ‘no’, otherwise (* this test 
is iL-recursive *). If M returns a hypothesis i within n steps of computation, 
let M' return i on otherwise let M'{tn) = 

Note that there must be some n, such that M' answers all queries of M 
truthfully respecting L. Thus it is not hard to verify that the iF-recursive IIM M' 
learns L in the limit from text. Moreover, WM'(tn) 7^ ^ assuming 

W]\4'{t„) L implies, by normalisation of M, that all queries M' has answered 
with ‘yes’ in the simulation of M indeed represent supersets of L. Since all 
‘no’-answers are truthful respecting L by definition, this yields a truthful query- 
scenario for L. As M learns L from superset queries, the hypothesis i must 
correctly describe L — a contradiction. So M' learns C without ever returning an 
index of a proper superset of a language currently to be identified. Now it is not 
hard to modify M' into a AT-recursive IIM which works conservatively for the 
class C (a hypothesis will only be changed if its inconsistency is verified with the 
help of a iF-oracle) . Thus C G ConsvTxtr.e.[K] dead rSupQ^ ^ C ConsvTxty;,e\K]. 

® Think of M as a normalisation of a snperset query learner M~\ M copies M~ nntil 
M~ retnrns the hypothesis i. Now M asks a query for the language Wi instead of 
returning a hypothesis. Then let M retnrn a hypothesis j representing the intersec- 
tion of all qneries answered with ‘yes’ in its preceding scenario. Given a fair scenario 
for Wi and a successfnl learner M“, this implies Wi = Wj and thus M is successful. 
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Second, we show Consv Txtr.e.[K] C rSupQ^^ . For that purpose assume C is 
an indexable class in Consv Txt^.e. [K]. Let M be a iC-recursive IIM identifying C 
with respect to (LFi)igN. Suppose L G C is the target language. An rSupQ- 
learner M' for L with respect to (LFi)igN is defined by steps, starting in step 0. 
Note that representations in (IFi)igN can be computed for all queries to be asked. 
In step 0, M' finds the minimal m, such that the query for is answered 

‘no’. M' sets t(0) = and goes to step 1. In general, step n+ 1 reads as follows: 

- Ask a superset query for If the answer is ‘no’, let t{n+l) = 

if the answer is ‘yes’, let t{n + 1) = t{n). 

(* Note that content{tn+i) = L fl {w* | a; < n -I- 1}. *) 

- Simulate M on input tn+i- Whenever M wants to access a AT-oracle for the 
question whether j G K, formulate a superset query for the language 



W = 

3 




if is defined , 
otherwise . 



and transmit the received answer to M . 

(* Note that W'j is uniformly r.e. in j and LLj A L iff <pj{j) is defined. *) 

As soon as M returns i = M(t„+i), pose a superset query for Wi. If the answer 
is ‘yes’, then return the hypothesis i and stop (* since M learns L conservatively, 
we have L and thus Wi = L*).li the answer is ‘no’, then go to step n+2. 

Now it is not hard to verify that M' learns C with superset queries in 
Details are omitted. Thus C G rSupQ^ ^ and Consv TxC,e\K] C rSupQ^ ^ . □ 

Using this characterisation. Theorem 5 translates as follows: 

Theorem 5’ Consv Txtr.e.[K] C BcTxt^,^. ■ 

Proof. By Theorems 6 and 7 it suffices to prove BcTxtr,e. \ Consv Txtr,e.[K] ^ 0. 
For that purpose we provide an indexable class Cbc G Be Txt,.,e, \ Consv Txt^.e. [K] . 
For all k gN, Cbc contains the language Lk = {a^b^ \ z > 0}. Moreover, for all 
k,i,j G N for which Lpk[i — 1] is defined and j < i, let Cbc contain the language 



L 



k,i,3 ~ 



\z<j}, 



if ipk (*) is undefined , 
if (pk{i) is defined . 



To show that Cbc G BcTxt, it suffices by Theorem 1 to prove the existence 
of telltales corresponding to some indexing of Cbc. This is quite simple: as each 
language G Cbc is finite, it forms a telltale for itself. Moreover, as for all k 
there are only finitely many subsets of Lk in Cbc, telltales for Lk must exist, too. 

Finally, it remains to prove that Cbc ^ Consv Txt,:,e. [K] . Assume the opposite, 
i.e., there is some iC-recursive IIM M which Consv Txt-identiiies Cbc in 
The idea is to deduce a contradiction by concluding that Tot is iC-recursive. For 
that purpose, define a iC-recursive procedure on input k as follows: 

- Let t = ... be the ‘canonical’ text for Lk- 

- Simulate M on input until some n is found with content (tn) C 

Q Lk- (* n exists, as M learns Lk- Determining n is AT-recursive. *) 

- If ipk(i) is defined for all i < n, then return ‘1’; otherwise return ‘O’. 
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Obviously, this procedure is iL-recursive. Note that it returns ‘0’ only in 
case Lpk is not total. So assume it returns ‘1’. Then there is some n such that 
content{tn) C Wm( 4 „) O L^- If was not total, the minimal i for which (pk{i) 
is undefined would be greater than n. Thus L = {a^b^ \ z <n} & Cbc- Now is 
also a text segment for L, but L = content{tn) C WM(t„)- Thus M hypothesizes 
a proper superset of L on input and hence M fails to learn L conservatively. 
This contradicts the choice of M, so (pk is total. 

Consequently, our procedure decides Tot, i.e.. Tot is iL-recursive. As this is 
impossible, we have Cbc ^ ConsvTxt„,e.[K]. □ 

In particular, the class Cbc defined in this proof constitutes a second quite ac- 
cessible example — after the one in [11] — for a class identifiable by a behaviourally 
correct learner in Godel numberings but not identifiable in the limit. An interest- 
ing feature is that these classes are defined without any diagonal construction — 
very unlike the corresponding classes known before, see for instance [1]. 

Finally, thus Theorem 5 is proven, too. This is an example for the advantages 
of our characterisations; verifying Theorem 5 without Theorems 6 and 7 would 
have been possible, but more complicated. So features of Gold-style learning 
can be exploited in the context of query learning. Note that, with Theorems 6 
and 7, we have characterised all types of extra query learning in terms of Gold- 
style learning. In order to better describe and understand the capabilities of the 
original query learners (the types rSupQ, rSupMemQ, rDisQ, rDisMemQ), let 
us have a closer look at the results established up to now. 

We know that rSupQ,.^^ = rDisQ,.^,, = ConsvTxt^ec- Note that, according 
to [9], for successful conservative learning, it is decisive, whether or not the 
learner is allowed to hypothesize languages not belonging to the target class. 
That means, if we denote by PConsvTxt the family of all indexable classes C, 
for which there is an indexing exactly describing C and an IIM Consv Txt- 

learning C in (Li)igN) then we obtain PConsvTxt C Consv Txt^ec- 

Similarly, the decisive difference between extra query learners and the original 
query learners is the ability to pose queries representing languages not belonging 
to the target class. 

This raises the question how the original types of query learning can be 
compared to class-preserving conservative inference. As it turns out, P Consv Txt- 
learners have much in common with rSupMemQ-leaxnevs. In contrast, significant 
discrepancies between PConsvTxt and the query learning types rSupQ, rDisQ, 
and rDisMemQ can be observed. 

Theorem 8. (a) PConsvTxt C rSupMemQ . 

(b) TifPConsvTxt for all T G {rSupQ, rDisQ, rDisMemQ} . 

Proof, (a) The proof of PConsvTxt C rSupMemQ results from a slight modifi- 
cation of the proof of Consv Txt^ec G rSupQ in [10]: 

Fix C G PConsvTxt. Then there is an indexing (Li)igN of C and an IIM M, 
such that M is a P Consv Txt-learner for C in Note that, as in the general 

case of conservative inference, if L G C and t is a text for L, then M never returns 
an index i with L G Li on any initial segment of t. 
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An rSupMemQ-lea,TneT M' identifying any L G C may use membership 
queries to construct a text for L and then simulate M on this text until M 
returns an index of a superset of L. This index is then returned by M'\ 

First, to effectively enumerate a text t for L, M' determines the set T of all 
words in S* , for which a membership query is answered with ‘yes’. Any recursive 
enumeration of T yields a text for L. 

Second, to compute its hypothesis, M' executes steps 0,1,2,... until it 
receives a stop signal. In general, step n, n G N, reads as follows: 

- Determine i := M(tn), where t is a recursive enumeration of the set T. Pose 
a query referring to Li. If the answer is ‘yes’, hypothesize the language Li and 
stop. (* As M never hypothesizes a proper superset of L, Li equals L. *) If the 
answer is ‘no’, then go to step n -I- I. 

It is not hard to verify that M' is a successful rSupMemQ-leainer for C. 
Further details are omitted. So PConsvTxt C rSupMemQ . 

To prove rSupMemQ\P ConsvTxt yf 0, we provide a separating class Cgup: for 
all fc G N, let Csup contain the language Lk = {a^b^ I z > 0} and, if /c G AT, addi- 
tionally the languages L\^ - = {a^b^ \ z < ^k{k) or [z > ^k{k)+j and z is odd]} 
and L\ j = \ z < <b>k{k) or [z > <b>k{k) + j and z is even]}. 

Using the indexing {L'^^ .^)kjeN, given by = Lk, L\k, 2 j+y) = ^k,j) 

k G K, and 2j+y) = Lk k ^ K , one easily verifies Cgup G rSupQ and thus 
Csup G rSupMemQ . In contrast to that, Cgup 4- PConsvTxt, because otherwise 
K would be recursive. Details are omitted. 

(b) rDisQ \ PConsvTxt yf 0 and rDisMemQ \ PConsvTxt yf 0 follow from 
Theorems 8(a), 3(b), and 4(b). For the other claims we just provide the separat- 
ing classes: PConsvTxt \ rDisMemQ yf 0 is witnessed by the class Ca from the 
proof of Theorem 4(a). The class Csup (see above) belongs to rSupQ\P Consv Txt, 
whereas the class consisting of the language L = {a}*U{6} and all the languages 
Lk = {a, . . . , a^}, fc > 0, belongs to P Consv Txt \ rSupQ. □ 

5 Discussion 

New relations have been established between learning via queries and Gold-style 
language learning — depending on the hypothesis space. In particular, learning 
with superset queries in uniformly r. e. numberings has revealed a natural in- 
ference type in-between LimTxt^.e. and BcTxL.e.- In correspondence to other 
characterisations this inference type has an analogue in Gold-style learning. 

As we have seen, the learning capabilities of query learners depend on the 
choice of the query and hypothesis space. A similar phenomenon may be ob- 
served also in the context of Gold-style language learning, where for instance 
BcTxtr.e. A BcTxt^ec = LimTxtrec- In contrast to that, for some models of 
Gold-style learning, the choice of the hypothesis space is ineffectual: Recall that 
LimTxty- e, = LimTxt^ec- Now assume (Aj)igp} is any family (not necessarily uni- 
formly r.e.) and C is any indexable class of recursive languages. It is not hard 
to prove that 
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- If C is Lzm Txf-leamable wrt (Ai)igN, then C € LimTxtrec ■ 

- If C is BcTxt-learnahle wrt (Aj)jgp^, then C G BcTxt^ ^, . 

But to what extent is the choice of the hypothesis space relevant in conservative 
learning in the limit? Whereas each class C G LimTxtrec can even be identified 
with respect to a uniformly recursive indexing exactly enumerating C (a folklore 
result), class-preserving conservative learners may be poor compared to unre- 
stricted Consv Tajfrec-lcarners, i. e., PConsvTxt C ConsvTxtrec, see [9]. So it 
remains to analyse the relevance of uniformly r.e. hypothesis spaces in the con- 
text of conservative learning. It turns out that uniformly r. e. numberings are not 
sufficient for conservative IIMs to achieve the capabilities of LimTxt-lea,rners. 

Theorem 9. ConsvTxt^.e. C LimTxtrec ■ 

Proof. ConsvTxtr.e. Q LimTxtrec follows from LimTxtr.e. C LimTxtrec- 

LimTxtrec \ ConsvTxtr.e. yf 0 is witnessed by an indexable class C from [9]: 
For each k, C contains the language Lk = {a^6^ I 2 > 0} and, if k £ K, 
additionally the languages Lkj = \ z < j} for all j < <Lk{k). [9] shows 

that C G LimTxtrec \ ConsvTxtrec- Adopting the corresponding proof one can 
verify C ^ ConsvTxtr.e. and thus ConsvTxtr.e. C LimTxtrec- ^ 

Whether or not each class in Consv Txtr.e. can also be identified conserva- 
tively in some uniformly recursive numbering, remains unanswered. Interestingly, 
if a class C G Consv Txtr.e. can be identified by a learner which is conservative 
and consistent for C, then C G ConsvTxtrec — see Theorem 10. If is any 

family of languages, M an IIM, and C some indexable class, then we say that M 
learns C consistently in (Ai)jgfsj, if content{tn) C for all text segments 

of languages in C. For convenience, we denote by Cons-ConsvTxp.e. the family 
of all indexable classes C, for which there is a uniformly r. e. family and 

an IIM M, such that M learns C both consistently and conservatively in (Aj)jgn. 

The proof of Theorem 10 will make use of the fact that ConsvTxtrec = 
Cons-ConsvTxtrec according to [9], where Cons-ConsvTxtrec is defined as usual. 

Theorem 10. ConsvTxtrec = Cons- ConsvTxtr.e. - 

Proof. By ConsvTxtrec Q Cons-ConsvTxtrec Q Cons-ConsvTxtr.e. it remains 
to prove Cons- ConsvTxtr.e. Q ConsvTxtrec- For that purpose suppose C is an 
indexable class in Cons- ConsvTxtr.e.- If C is finite, then C trivially belongs to 
ConsvTxtrec- So suppose C is an infinite class. 

By definition, there is an IIM M which learns C consistently and conserva- 
tively in the limit in (IFi)igN- Moreover, let (Li)igN be an indexing for C. 

Given fc G N, define the canonical text t^ for Lk as follows: t^(0) = w^, 
where m = min{f | w* G Lk}- For n > 0 let t^(n) = if Wm-i-n ^ 

t^{n) = t^{n — 1), if Wm-i-n ^ ^k- Now let be an effective enumeration 

of all pairs {k,n) of indices such that n = 0 or M{t^) yf M{t^_-^). 

The aim is to define an indexing comprising C and a recursively 

generable family (T/)jgN of telltales for {L'^)i^^. Using Theorem 1 this implies 
C G ConsvTxtrec- For that purpose we will define several auxiliary indexings. 
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Define an indexing (Aj)jgN as follows: for j G N let 






Lkj if = M{tni) fo'' all a; > 0 , 

content{t^,^^_i) if x is minimal with ^ M{tnj) ■ 



Claim 1. For each L G C there is some i G N with A' = L^. = L. 

The proof of Claim 1 is omitted. 

Fix an indexing (Glj)jgfj enumerating all the languages A'j without repeti- 
tions. This is possible, since, by Claim 1, (A^ieN comprises the infinite class C. 
Accordingly, let {yi, Zi)igN be an effective enumeration with = tn] if Ai = Aj. 
Claim 2. For each L G C there is some i G N with Ai = Ly. = L. 

Claim 3. Let i G N. Then Ai C Wh for h = 

Claim 4- Let i G N. = Ly. iff Wh = Ly^ for h = 

The proof of Claims 2-4 is left to the reader. Now we define indexings 
{Ti)i(=ji'. for f G N, construct Bi and Ti according to the following procedure: 

(* The construction will yield Bi = Ai ox Bi = %. Finally, it will be uniformly 
decidable whether or not Bi = %. *) 

~ If tzj yf tf. for all j < i, then let Bi = Ai and Tj = contentify*). 

- If some j < i fulfils = tf., then act according to the following instructions: 

- Let ft, = and {ji, ... ,js} = {j <i \ = t%i}. 

- Execute Searches (a) and (b) until one of them terminates: 

(a) Search for some x with ^ ft. 

(b) Search for . . . ,x^^ with yf ft for all j G {ji, . . . ,jJ. 

(* Note that, for all j G {ji, . ■ . ,js}, there must be some x with y^ 

ft or ^ h {= M(tzj)). Otherwise one would obtain Ai = Ly. and 

Aj = Ly. . Since h = M (t|* ) = Claim 4 would imply Wh = Ly^ = Ly. 

and thus Ai = Aj. This is impossible since (Ai)igN avoids repetitions. *) 

- If Search (a) terminates first, let Bi = Ti = 0. If Search (b) terminates first, 
then execute Searches (b.l) and (b.2) until one of them terminates: 

(b.l) Search for some x with y^ ft. 

(b.2) Search for Wj^ , ■ • ■ , with Wj G \ Bj for all j G {ji, ... ,js}- 

(* Suppose j G {ji, . . . ,js}- Note that there is some x with )^h 

or some w G Ai \ Bj. Otherwise we would have Ai C Bj and = ft 

for all a; G N. This yields Ai = Ly. and, with Claim 4, A^ = Wh. Therefore 
Wh C Bj. Claim 3 then implies Bj C Wh and hence Ai = Wh = Bj = Aj, 
which is a contradiction to the injectivity of the indexing *) 

- If Search (b.l) terminates first, let Bi = Ti = 0. If Search (b.2) terminates 
first, then let Bi = Ai and Ti = content{tl'.) U {wj.,, . . . ,Wj^}. 

This procedure yields a uniformly recursive indexing (Li)igN and a recursively 
generable family {Ti)i^jq of (possibly empty) finite sets with Ti C Bi for all i. 
The proof is left to the reader. Note that = 0 iff = 0. Moreover, Bi = Ai 
iff Bi yf 0. In particular, it is uniformly decidable whether or not Bi = 0. 
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Claim 5. comprises C. 

Proof of Claim 5. Let L G C. By Claim 2 there is some i G N with Ai = Ly. = L. 
So, by definition, = h for all x G N. Thus either ^ for 

all j < i or Searches (b) and (b.2) terminate first in the construction of Bi. This 
yields Bi = Ai = L. Hence comprises C. qed Claim 5. 

Finally, define an indexing (L')jgN by removing the empty language from 
For i and j with L' = Bj let T' = Tj. Thus {T')i^jq is a recursively 
generable family of non-empty finite sets with T' C L' for all i. It remains to 
show that C, and (T/)jgN fulfil the conditions of Theorem 1.2, i.e., 

(i) (L')jeN comprises C, 

(ii) m) igN is a telltale family for 

ad (i). This is an immediate consequence of Claim 5 and the definition of (L')igrj. 

ad (ii). To show that T/ C L' C L' implies L' = L'i, suppose T', C L(, C L', 
holds for some i',j' G N. Let i,j G N with L', = Bi and L'~, = Bj. This yields 
L'/ = Ai and L', = Aj. By the properties of canonical texts, is an initial 
segment of . Therefore one of the segments is an initial segment of the 

other. Let hi = hj = and consider three cases. 

Case 1. Zj < Zi. Then is a proper initial segment of In particular, 
and M(fy^) = hj. Note that contentftfi) C T[, C L', = Aj. Moreover, 
by Claim 3, Aj C Wh^ and thus content {ty)) C Why Since M is conservative on 
any text for Ly. A Ai, this yields hj = M\tf-^ = M(t\'.yi) = ■ ■■ = M{tf,). By 
definition of the family (/cj,ni)igN, we have M{tff) ^ M{t(,'._{). This results in 
Zi = Zj and thus in a contradiction. 

Case 2. Zi < Zj. Then is a proper initial segment of tzj. In particular, 
Czl = tf^ and MfCzl) = Note that contentfCz)) C L', C L', = Ai. Moreover, 
by Claim 3, Ai C Wh, and thus content{tzj) C Why Similarly as above, this 
results in Zi = Zj and thus in a contradiction. 

Case 3. Zi = Zj. Then tf) = tzy First, assume i > j. Since Bi ^ 0, in the 
construction of Bi some w G Ai\ Bj is included in Ti. This yields w G Tf C L', . 
So w G L'j, \ Bj in contradiction to L', = Bj. Second, assume i < j. Since 
Bj ^ 0, during the construction of Bj some w G Aj \ Bi is found. This yields 
w G L'j,\ L', in contradiction to L' , C L',. 

Since each case yields a contradiction, our assumption has been wrong, i. e., 
there are no indices i' ,f G N such that T', C L', c L',. This finally proves (ii). 

By Theorem 1.2, the families (L')jgfij and (T')i^j^ witness C G ConsvTxtrec- 
Since C G Cons-ConsvTxC.e. was chosen arbitrarily, our argument directly proves 
Cons-ConsvTxtr.e. C ConsvTxtrec and so ConsvTxt^ec = Cons-ConsvTxtr.e.. LI 

As it turns out, throughout the whole proof we never use the fact that the 
hypothesis space for Cons-ConsvTxt-identi&caiion is uniformly r. e. This implies 
the following corollary: Assume (Ai)jgN is any family of languages (not neces- 
sarily uniformly r.e.) and C is any indexable class of recursive languages. 

- If C is Cons-ConsvTxt-leaxnahle wrt (Ai)jgN, then C G ConsvTxtrec ■ 
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The following figure summarizes our main results. 
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This graph illustrates the rela- 
tions between different inference 
types studied above. If two infer- 
ence types are contained in one 
box, they are equal. Each vec- 
tor indicates a proper inclusion 
of inference types, whereas miss- 
ing links symbolize incompara- 
bility. The dashed box around 
ConsvTxtr.e. is used to indicate 
that it is not yet known, whether 
or not ConsvTxtr.e. belongs to 
the adjacent box below. 
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Abstract. We investigate further improvement of boosting in the case 
that the target concept belongs to the class of r-of-fc threshold Boolean 
functions, which answer “+1” if at least r oik relevant variables are pos- 
itive, and answer “—1” otherwise. Given m examples of a r-of-fe function 
and literals as base hypotheses, popular boosting algorithms (e.g., Ad- 
aBoost) construct a consistent final hypothesis by using 0{k^ log m) base 
hypotheses. While this convergence speed is tight in general, we show 
that a modihcation of AdaBoost (conhdence-rated AdaBoost [SS99] or 
InfoBoost [AslOO]) can make use of the property of r-of-fe functions that 
make less error on one-side to find a consistent final hypothesis by us- 
ing 0(kr log m) hypotheses. Our result extends the previous investiga- 
tion by Hatano and Warmuth [HW04] and gives more general examples 
where confidence-rated AdaBoost or InfoBoost has an advantage over 
AdaBoost. 



1 Introduction 

Boosting is a method of constructing strongly accurate classifiers by combining 
weakly accurate base classifiers, and it is now one of the most fundamental tech- 
niques in machine learning. A typical boosting algorithm proceeds in rounds as 
follows: At each round, it assigns a distribution on the given data, and finds 
a moderately accurate base classifier with respect to the distribution. Then it 
updates the distribution so that the last chosen classifier is insignificant for it. 
It has been shown that this algorithm converges relatively quickly under the as- 
sumption that a base hypothesis with error less than 1/2 — 7 can be found at each 
iteration. For example, under this assumption, Freund showed [Fre95] a boosting 
algorithm and proved that it yields a e-approximation of the correct hypothesis 
by using 0 ((l/ 7 ^) log(l/e)) base hypotheses. Freund also proved that the result 
is in a sense tight; that is, if the final hypothesis is of a majority of weak hypothe- 
ses (like his boosting algorithm does), then any boosting algorithm needs, in the 
worst case, l 7 ((l/ 7 ^) log(l/e)) weak hypotheses to obtain an er-approximation. 
While Freund’s result is tight in general, it is still possible to improve boosting 
for some particular cases. For example, Dasgupta and Long [DL03] showed that 
the above boosting algorithm by Freund works more efficiently when the set 
of base classifiers contains the subset of good “diverse” classifiers. This paper 

* This work is supported in part by a Grant-in- Aid for Scientific Research on Priority 
Areas “Statistical-Mechanical Approach to Probabilistic Information Processing” . 
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studies one of such cases; here we consider the situation where error occurs less 
often in one side. 

Let us see the following illustrative example. Consider learning k- 
disjunctions. That is, given m labeled examples for a target /c-disjunction / 
over n Boolean variables, our task is to find a hypothesis consistent with / on 
these examples. Let TL be the set of very simple Boolean functions that are ei- 
ther a function defined by one literal or a constant -1-1 or —1, and let be 
the subset of TL consisting of functions defined by one Boolean variable, i.e., a 
positive literal. By using functions in TL as base classifiers, a popular boosting 
algorithm (e.g., AdaBoost) can find a consistent hypotheses using O(fc^logm) 
base hypotheses. In contrast, a use of the greedy set-covering algorithm can ac- 
complish the same task with O(fclogm) hypotheses, as learning of fc-disj unctions 
can be reduced to the set-covering problem [Nat91,KV94]. Note that for the set- 
covering problem this bound cannot be improved by a constant factor efficiently 
under a realistic complexity-theoretic assumption [Fei98]. Now it is natural to 
ask whether boosting algorithms can beat the greedy set-covering algorithm in 
this learning problem. Hatano and Warmuth [HW04] showed that a modifica- 
tion of AdaBoost, called confidence-rated AdaBoost [SS99] or InfoBoost [AslOO] 
(in the following, we simply call it InfoBoost), can achieve O(fclogm) bound. 
Moreover, they showed that AdaBoost does need I2(fc^logm) hypotheses under 
some technical conditions^ . 

In the above case, we may assume that all base hypotheses in TL+ have one- 
sided error property; that is, their positive predictions are always correct, in 
other words, all these base hypotheses have 0 false positive error. InfoBoost is 
designed so that it can make use of this situation, and this is why it performs 
better than AdaBoost. In general, when this one-sided error property can be 
assumed for good base hypotheses, InfoBoost can reduce the number of boosting 
iterations from 0((I/7^) logm) to 0((I/7) logm). Besides this extreme case, we 
may naturally expect that InfoBoost can also make use of “partially” one-sided 
error property. This is the question we discuss in this paper. As a concrete 
example of “partially” one-sided error situations, we consider here the problem 
of learning r-oi-k threshold functions. For any r and k, a r-of-fc function (over 
n Boolean variables) consists of k relevant variables, and it answers -1-1 if at 
least r of the k variables are positive, and otherwise answers —1. We learn these 
functions by using again TL as the set of base classifiers. Note that fc-disj unctions 
are nothing but 1-of-fc functions; hence, our learning task is a generalization 
of learning fc-disjunctions. Also for small r, we may expect that hypotheses in 
TL+ have on average small (but may not be 0) false positive error. We show 
that InfoBoost can still make use of this one-sided error property and find a 
consistent hypothesis by using O(fcrlogm) base hypotheses. Thus, the running 



^ A typical setting is that a boosting algorithm has a pool of base hypotheses and it 
can choose hypotheses greedily from the pool. But in the setting of [HW04] it has 
only access to a weak learner oracle, which is just guaranteed to return a hypotheses 
with its error below a fixed threshold j. 
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time of InfoBoost is from 0{klogm) (when r = 0(1), e.g., /e-disjunctions) to 
O(fc^logm) (when r = f2{k), e.g., ^-variable majority functions). 

Unfortunately, we know few related results on learning of r-of-fc functions. 
The online learning algorithm WINNOW [Lit88] (more precisely, WINN0W2) 
can be used to learn r-of-fc functions. With its fixed parameters a = 1 + l/2r, 
and 9 = n, WINNOW has a mistake bound of 0(fcr log n) for r-of-fc functions. 
(Note that n is the number of Boolean variables in the domain, not the number 
of the examples m.) However, unlike confidence-rated AdaBoost or InfoBoost, 
it needs the knowledge of r in advance. 

Other related optimization problems might be covering integer programs 
[Sri99,Sri01] and mixed covering integer programs [LOl]. The former can be 
viewed as a general form of the set-covering problem, and the latter is a further 
generalization the former, motivated by the problem of learning the minimum 
majority of base hypotheses. Both problems can be approximately solved by ran- 
domized polynomial time algorithms, whose solutions have size O(doptlogopt) 
and O(dopt^), respectively (here, opt denotes the size of the optimal solution 
and d is called pseudo-dimension. See [LOl] for the details). Learning of r-of-fc 
functions may be characterized by an “intermediate” problem of the both. 

Technically our contribution is to develop a way to show a base hypothesis 
with reasonable advantage and small error probability on one side from the fact 
that H has on average reasonable advantage and small false positive error. 

2 Preliminaries 

In this section, we give basic definitions. Throughout this paper, the domain of 
interest is A = {— 1,-|-1}". By a “Boolean function” we will mean a function 
from X to {— 1,-|-1}. We use Xi, i = 1 . . . , to denote Boolean variables, which 
naturally define functions from X to {— 1,-|-1}. Let FJ. be the set of r-of-fc func- 
tions, i.e.. Boolean functions on k Boolean variables that return -|-1 if at least 
r of these k variables are assigned -1-1. For any / G FJI, Rel(/) is the set of the 
relevant Boolean variables of /. 

For any / G Fj^, a pair (x,f{x)) is called an example of /. A set of such 
pairs is called a sample. We will use H = {Ai, . . . , A„, -|-1, —1} as the set of base 
hypotheses (or weak hypotheses). For any distribution D, any Boolean function 
/, and any hypothesis ft. G "H, we define 

err\^{f,h) =Pr[/(x) = +l\h{x) = -l], 

err+(/,ft) = Pr[/(x) = +l,h{x) = -1], 

err|))(/,^) =P^[f{x) = -l\h{x) =-kl], 

= Pr[/(x) = -l,ft(a;) = +1], and 

errors (/, ft) = err^(/, ft) -k errjj(/, ft). 

We will omit D and / in the notation when it is clear from the context. 
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Given: A set S of m examples {{x^y)\x & X ,y = f{x) }. 

We denote elements of S' as {xi,yi), {xm,ym)- 

begin 

1. Let Di{i) = 1/m, for i = 1, m. (I.e., Di{i) = Do{xi).) 

2. For t = 1 to T, repeat the following procedures: 

a) Select ht dH that minimizes Zt defined below (see (1)). 

b) Let at(+l) = I In = 1 ^ ' 

c) Define the next distribution Dt+i as follows for each i = 1, 

^ ... Dt{i) exp{-yi ■ at{ht{xi)) ■ ht{xi)} 

X>t+i{t) = ^ , 

{Zt defined by (1) is the normalization factor.) 

3. Output the final hypothesis /out defined by 

/ T 

fout{x) = sign '^at{ht{xi)) ■ ht{xi 



Fig. 1. A Boosting Algorithm 



For our learning algorithm, we will consider InfoBoost [AslOO] (or confidence- 
rated AdaBoost [SS99]) with % as the set of weak hypothesis; Figure 2 shows 
the description of this boosting algorithm on a given sample S = { (x, j/) | a; G 
= f{x) } of TO examples for some target concept /. Recall that the goal of 
this algorithm is to obtain a (combined) hypothesis /out that is consistent with 
/ on S, which we will simply call a consistent hypothesis. More precisely, for any 
X G X, define Dq{x) = 1/to if a; appears in S and Dq{x) = 0 otherwise. Then the 
consistent hypothesis, which we want to get, is /out such that error D^{f, /out) = 
0. For this goal, the following boosting property is known [SS99, AslOO]. 

Theorem 1. For any T > 1, let /out be the hypothesis that the boosting algo- 
rithm yields after T iterations. Then the following bound holds for the training 
error of this hypothesis. 



T 

error Do if, font) < 

t=i 

Here Zt is the normalization factor of the t-th iteration, which is defined as 
follows. 



Zt = ^ Dt{i)exp{-yia{ht{xi))ht{x^)}. ( 1 ) 

l<2<m 



3 Main Result 



Theorem 1 gives the convergence speed of the boosting algorithm of Figure 2. 
In our setting, it is easy to bound Zt by a/ 1 — l/0(fc^) for any r-of-k target 
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function, from which it is shown that the algorithm yields a consistent hypothesis 
in 0{k^logm) steps. Our technical contribution is to improve this bound to 
a/1 — l/0{rk), thereby improving the upper bound for the boosting steps to 
0(fcr log m). 

In the following, for any r > 1 and fc > 1, consider any r-of-fc function / for 
our target concept, let it be fixed, and we will estimate the normalization factor 
Zt of each step t > 1. Our goal is to prove the following bound. (For Z\, by 
definition we have Z\ < 1.) 

Theorem 2. For any t > 2, we have 



512r/c' 

We first show some key properties used to derive this bound. In the following, 
let any t > 2 be fixed, and let us simply denote Zt as Z. Also let D be the 
distribution Dt defined in our algorithm at the t-th iteration. Precisely speaking, 
Dt is defined on f = 1, ..., m; but we may regard this as a distribution Dt on X 
such that Dt{x) = Dt{i) ifx = Xi for some z = 1, ..., m, and Dt{x) = 0 otherwise. 
Let D denote this distribution. Then the following fact is checked easily. 



Fact 1. For our distribution D, we have Frr>[f(x) = +1] = Frr>[f{x) = —1] = 

1 / 2 . 



We say that such a distribution D is balanced (with respect to /). 

Using this property and the assumption that / is a r-of-fc function, the fol- 
lowing weak hypothesis can be guaranteed. In the following, let %' be the non- 
constant hypotheses of "H; that is, %' = {Xi, . . . , A„}. Also when h = Aj, we 
simply write err|/" and err|“ for err|/((/, /i) and err|^(/, /i) respectively. 



Lemma 1. 

error(/i) < - 



There exists a hypothesis h € TL' (in fact, some h G Rel(/) ) with 
1 



Proof. Consider the total number of mistakes made by hypotheses Xi in Rel(/). 
For any positive instance, the total number of mistakes is at most k — r. Hence, 
we have Ex,GRei(/) = Ex,GRei(/) PT^oiXiix) = -1, f{x) = 1} < (fc - r) • 
Pioifix) = 1} = (fc — r)/2, because D is balanced. By a similar argument for 
negative instances, we have ExiGRei(/) < (z" — l)/2. In total, we have 



E , ^ , k — r r — 1 
error(A,) < -b 

XiGRel(/) 



k-l 

2 



On the other hand, Rel(/) has k variables; therefore, by the averaging argument, 
there should be at least one Xi satisfying the lemma. □ 



This bound is somewhat standard. In fact, using this bound, one can bound 
Z by ^/T^-TJ{2k)^ . While this bound cannot be improved in general, we may 
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be able to have better bounds if the error is estimated separately for positive 
and negative instances, which is the advantage of using InfoBoost. For example, 
if the target function / is the disjunction of some k variables (i.e., r = 1), then 
any h G Rel(/) does not make a mistake on negative instances; thus, it follows 
above that there exists h G T~L' whose general error probability is bounded by 
1/2 — l/2fc and whose error probability on negative instances (i.e., err~(/i)) 
is 0. The bound O(fclogm) [HW04] is derived from this observation. Here we 
would like to generalize this argument. In the case of fc-disj unctions (or 1-of-fc 
functions), it is guaranteed that base classifiers in %' have good advantage on 
average and all of them have 0 false positive error. But now the assumption we 
can use is that they have good advantage and small false positive error both on 
average. Our technical contribution is to derive a similar consequence from this 
assumption. 



Lemma 2. There exists a hypothesis h GTL' (in fact, h G Rel (/) ) such that the 
following holds for some integer a, 1 < a < 4k with constant cq = 2. 



, . 1 a 

error(/i) < - — and 

Z o/C 



err (h) < 



co(r - l)o^ 



(2) 

(3) 



Proof. Consider k hypotheses hi,...,hk in Rel(/); that is, each hi is a Boolean 
variables relevant to the target function /. Consider first the case that there are 
more than k/2 “relatively good” hi in Rel(/) that satisfies error (/i^) < 1/2— 1/4/c. 
As discussed in the proof of Lemma 1, we have X)?iieRei(/) ^ (^ ~ l)/2. 
Thus, among such relatively good hi, there must be some hi such that err” < 
((r — l)/2) G- {k/2) = (r — l)/k. This clearly satisfies the lemma with a = 1. 

Consider next the case that we do not have so many relatively good hypothe- 
ses. For each hi, let Wi = (1/2 — l/{2k)) — error(/ij) < 1/2. Here note that some 
hi may have negative wp, for example, a hypothesis hi that is not “relatively 
good” has weight Wi < —l/{4k). Also note that the total, i.e., must be 

at least 0, because error(lii) is at most 1/2 — I /{2k) on average (see the proof of 
Lemma 1). 

Consider the set H of hypotheses hi that have nonnegative weights, and 
let s be the number of such hypotheses. Since more than k/2 hypotheses 
are not “relatively good”, the total weights of all negative mfs is less than 
— (l/(4A;))(fc/2) = —1/8. Thus, letting cr = ^ — V^. Now 

we use Lemma A in Appendix with these Wi&, s, a, and t = 8k. Then from the 
lemma, we have either {i) s > k/2, or (ii) there exists some a < 4k for which the 
following holds. 






> 



k 

4a? ' 



Note, however, that the case (i) cannot happen because we assume that there are 
not such many good hypotheses. Hence consider the case (ii). For the parameter 
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a for which the above bound holds, consider hypotheses hiS such that Wi > 
a/8k. Let H' be the set of such hypotheses. Clearly, any hi in H' satisfies (2). 
Again since H' has s > k/{4:a^) hypotheses and the sum of their err“’s, i.e., 
err“, is still at most (r — l)/2, there must be some hypothesis in H' 
such that err~ < 2a^{r — l)/k, satisfying the lemma. □ 



Now we are ready to prove our theorem. 

Proof of Theorem 2. Let err“, err|“, err"*", err|“'", and err denote respectively 
err^(/i), err|^(/i), err^(/i), err|^(/i) and error for the hypothesis h selected 
at the t-th iteration under the distribution D (= Dt). First we restate (1) as 
follows (see, e.g., [AslOO]). 



Z = 2 ^ D{f)^ — err| )err| + E — err|+)err| + 

By using the fact that D is balanced, it is easy to restate it further as 




Z = 





err+. 



We consider two cases. 

(Case 1: % has a quite good hypothesis) 

Suppose that TL has a good hypothesis so that we have err < 1/2 — l/{ia>^frk) 
for the selected h. Recall that err = err+ + err+; then it is easy to see that, for 
any fixed value of err, Z takes the maximum when err“ = err’^ = err/2. Thus, 
considering this situation, we have 



Z < 2-\/err (1 — err) < 




1 

64r/c 



(Case 2: % does not have such good hypothesis) 

That is, suppose that TL has no hypothesis whose error is less than 1/2 — 
l/{\&\frk). Together with this assumption, it follows from Lemma 2 we have 
some hypothesis h £ TL satisfying the conditions (2) and (3) for some integer a, 
^ ^ CL < (1/2) {k/r). Then by Lemma B given in Appendix, an upper bound 

of Z is obtained by assuming the equalities in the conditions (2) and (3). Note 
that the constraints of Lemma B are satisfied for our choice of a. Thus, for any 
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r > 1, we have^ 



Z = 2\ I { err ) err+ + 2\l err ( err+ 



= 2 



\ 



= 2 



\ 



= 2 



\ 



1 



- — err err+ + \ / err ( - — err+ 



1 



err + err+ 



— 2err err+ + 2\l err err+ ( - — err 



- — err+ 



error 



— 2err err+ + 2\ err err+ ( - — 



1 err + err+ 



err err+ 



< \ 1 ; — 8err err+ + 8r/err err+ 

“ V 4fc 



( — 
V16A: 



err err+ 



= t / 1 ; — 8err err+ + 8err err+ 

V 4fc 






16fcerr err+ 



+ 1 



Since it holds by using the Tayler series that \/\ + x < 1 + | — fg for a; (0 < 
a; < 1), the quantity Z is bounded by the square root of 



1 8err err'^ + 8err err"^ ( 1 + 

4fc 

By simplifying this, we have 

z<\n- 



1 



1 



2 16/cerr err+ 16 (16/cerr err+)^y 



512fc^err err+ 

, ... _ 1 a 2a^(r — 1) 4/c — a — 16a^(r — 1) 

ihen by substituting err = - — — = — and 

Zi OfC fC Ofv 

, 2a?{r — 1) , 

err = , we obtain 

k 



4 

512(r - 1)(4A: -a- 16(r - l)o2) 

^ 512fcr 

□ 

Finally we state the following corollary that is derived from Theorem 1 and 
Theorem 2 by a standard analysis. 

Corollary 1. For any r and k, 1 < r < k/2 and 2 < k < n, and for any target 
r-of-k function f, the boosting algorithm given in Figure 2 yields a hypothesis 
that is consistent with f if T > (1024fcrlogm + 1). 

^ Here we explain the case r > 1. The case r = 1 can be treated by much simpler 
analysis, which can be found in [HW04]. 
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4 Concluding Remarks 

We proved that for r-of-fc functions, InfoBoost can find consistent hypotheses in 
0{rklogm) iterations (where m is the number of examples). On the other hand, 
AdaBoost seems to need O(fc^logm) iterations (which has been indeed proved 
conditionally in [HW04]); thus, our result indicates that InfoBoost performs 
better than AdaBoost when we may assume less error on one classification side. 
A problem of theoretical interest is to show the tightness of our analysis. (We 
conjecture that our upper bound is tight except for the constant factor.) 

Another interesting open problem on InfoBoost, which is also important in 
practice, is its generalization error. We are still unaware of the way to bound the 
generalization error of InfoBoost. Recall that Schapire et al. [SFBL98] showed 
that “margin” can be used to estimate the generalization error of a hypothesis 
defined as a linear combination of hypotheses. While this analysis is applicable for 
hypotheses produced by InfoBoost, it may be the case that InfoBoost produces 
a hypothesis with “small margin”, which is not enough to guarantee a small 
generalization error (see [SS99] for the detail). One way to avoid arguments on 
generalization error is to introduce the boosting by filtering framework [Sch91, 
Fre95]. In this framework, instead of being given a set of examples in advance, 
the learner can directly draw examples randomly according to a fixed unknown 
distribution. Several researchers pointed out that AdaBoost does not fit this 
scheme because it constructs too “skew” distributions from which it is hard to 
sample examples [DW00,BG01,Ser01,Gav03]. This also accounts for InfoBoost. 
They proposed variants of AdaBoost that create “smooth” distributions enabling 
us to sample efficiently. One of the interesting problems is to extend some of these 
ideas to InfoBoost. 
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Appendix 

Lemma A. Let {wi, ...,Ws} be a set of weights such that 0 < Wi < 1/2 for all 
i, 1 < * < s, and ~ Then for any t > 0, either s > (71/2, or there 

exists some a (1 < a < t/2), such that the following holds. 

f . . a I . dQut 

{■‘“'•-(I - 

where we may set do = 1/4. 

Remark. Note that the average weight (X)'^*)/'® = Then the Markov 
inequality claims that llmi > a/t'\\ (= |{wi > a(cr/s)}| < s(l/o;) =) ut/a for 
all a > 0, where a = sa/{at). Thus, the lemma is regarded as a lower bound 
version of the Markov inequality. We can improve this bound by replacing the 
denominator with smaller one, e.g., a(loga)^; but on the other hand, it is also 
easy to see that we cannot use the denominator a like the Markov inequality. 
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Proof. Assume that the lemma does not hold (i.e., neither s > at j 2 nor some a 
exists satisfying the lemma), and we will lead a contradiction. For any integer 
j > 0, define n{j) = d^at j . We may assume that w\ < < ■ ■ ■ < Wg. 

Then we have Wi < 1/t for the first Sq ^ s — n(0) weights, because otherwise, we 
would have more than n( 0 ) weights that are larger than 1/t, contradicting our 
assumption. Similarly, we have Wi < 2/t for the next si > s — sq — n{l) weights, 
Wi < 2^/t for the next S 2 > s — so — si — n{2) weights, and so on. It is easy to 
see that ^ Wi becomes largest if sq = n(0), si = s — sq — u-(l), ••• Thus, estimate 
Y/ Wi with So = n( 0 ), Si = s — Sq — n(l) = s — (s — n( 0 )) — n(l) = n( 0 ) — n(l), 
S 2 = s — So — Si — n(2) = n(l) — n(2), ... Then we have 

^ y • (s - n(0)) + ^ • (n(0) - n(l)) 

+ ^-(n(l)-n(2)) + ^.(n(2)-n(3)) + ... 

= - (s + n(0) + 2n(l) + 2^n{2) + 2^n{3) ■■■) 

s 2 22 23 

--^+{doa) + - + - + - + 

-j + {doa) + - + - + 

which is, from our assumption on s and the choice of do, less than a. A contra- 
diction. □ 



Lemma B. Consider the following optimization problem. 



Maximize 
subject to 




b + c<- 
^^ 4~2 




b,c>0, 




c 



for given 7 and S such that 7 > 0 and 0 < 5 < (1/2 — 7)/2 = 1/4 — 7 / 2 . Then 
the maximum is obtained if b + c = 1/2 — 7 and c= 1/4 — 7/2 — S. 

Proof. For any /3, 0</3<l/2 — 7, define Z{(3) by 



Z{P) = max; 14(6, c) = max W 6 ( - — c ) -I- 



b,c 



b,c 



-6 c, 



where 6 , c > 0 are chosen so that b + c = P, and c< 1/4 — 7/2 — 6 . We consider 
the following two cases. 
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(Case 1: /?< 1/2 -7-25) 

In this case, since (i/2 < 1/4 — 7/2 — 5, it is possible to choose b= c = P/2, which 
maximize V{b,c) = y/(l — P)p. Hence, Z{P) = y/(l — P)p. Since this Z{P) is 
monotone increasing, its maximum Zi is obtained with P = 1/2 — 7 — 25, and 
we have 



^1 




(7 + 25 ) 2 . 



(Case 2-. P> 1/2-7-25) 

Since V{P — c, c) is monotone increasing for c < 1/4 — 7/2 — 5, it is maximized 
when c = 1/4 — 7/2 — 5; hence, we have 




Again this Z(P) is monotone increasing, we have the maximum 



^2 = 





11 

4 



when P = 1/2 — 7 (i.e., 6 + c = 1/2 — 7 and c = 1/4 — 7/2 — 5). 

Now we show that Zi < Z 2 for any 5 such that 0 < 5 < 1/4 — 7/2. Let 
g{x) = ^/x and A = 1/16 — (5 + 7/2)^. Then we have 



Z 2 - Zi = g{A + p) + g{A- q) -2g{A), (4) 



where we let p = 25^+57+5/2, and q = — 25^— 57+5/2. Note that p,q > Ohy the 
definition of 5 and 7. Since the function g(x) is concave and monotone increasing, 
it follows that g{A + p) > g{A) + g'{A + p)p and g{A — q) > g{A) — g'{A — q)q. 
Therefore the quantity (4) is bounded from below by 



g\A + p)p - g\A - q)q 



25 " + 57+1 _ - 25 " - 57 + 

j(25 + 7+l)^(l-5)^-:g 
5(-25-7+^)y/(i + 5)^-^ 



§_ 

2 



11 

4 
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In order to show that Z 2 — > 0 , it is sufficient to prove that 




is positive. Let us denote the quantity above as A. Then we get 



A= (2S + jy + (2S + j) + - - 



1 



16 



- ^ (25 + 7)^ - (2J + 7) + - n ^ 

= 2(25 + 7) [^+^^-jr)-2-U(2^ + 7r + J 

= l-25^^-25^^-tr 




-(25 + 7 )" > 0 , 



as desired. 



toi Oo 




Boosting Based on Divide and Merge 



Eiji Takimoto, Syuhei Koya, and Akira Maruoka 

Graduate School of Information Sciences 
Tohoku University, Sendai 980-8579, Japan. 

{t2 , s_koya,maruoka}@ecei .tohoku. ac.jp 



Abstract. InfoBoost is a boosting algorithm that improves the perfor- 
mance of the master hypothesis whenever each weak hypothesis brings 
non-zero mutual information about the target. We give a somewhat sur- 
prising observation that InfoBoost can be viewed as an algorithm for 
growing a branching program that divides and merges the domain re- 
peatedly. We generalize the merging process and propose a new class of 
boosting algorithms called BP.InfoBoost with various merging schema. 
BP.InfoBoost assigns to each node a weight as well as a weak hypoth- 
esis and the master hypothesis is a threshold function of the sum of 
the weights over the path induced by a given instance. InfoBoost is a 
BP.InfoBoost with an extreme scheme that merges all nodes in each 
round. The other extreme that merges no nodes yields an algorithm for 
growing a decision tree. We call this particular version DT. InfoBoost. We 
give an evidence that DT. InfoBoost improves the master hypothesis very 
efficiently, but it has a risk of overfitting because the size of the master 
hypothesis may grow exponentially. We propose a merging scheme be- 
tween these extremes that improves the master hypothesis nearly as fast 
as the one without merge while keeping the branching program in a 
moderate size. 



1 Introduction 

Since the stage-wise process of AdaBoost was explained in terms of a Newton- 
like method for minimizing an exponential cost function [2,5], designing and 
analyzing boosting algorithms by using results from the optimization theory has 
been a mainstream of this research area [9,3,4,11]. In the most common scheme 
of boosting, the booster iteratively makes probability weightings over the sample 
and receives weak hypotheses that slightly correlate with the sample with respect 
to the current weightings. The booster then produces a linear combination of 
the weak hypotheses as a master hypothesis with a good performance guarantee 
expressed in terms of the margin [6,12]. Here, the correlation between a weak 
hypothesis and the sample is defined as the expected margin, which intuitively 
mean how well the hypothesis classifies the sample. 

As oppose to the margin based criterion, there is an information-theoretic 
approach where the correlation is defined by the mutual information, i.e., how 
much amount of information the hypotheses bring on the sample. In this scheme, 
the requirement for weak hypotheses can be relaxed to have non-zero mutual 
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information, rather than non-zero margin, to obtain a good master hypothesis. 
The first theoretical result on information-based boosting is due to Kearns and 
Mansour [8] who give theoretical justification to empirically successful heuristics 
in top-down decision tree learning. The result is naturally generalized for learning 
multiclass classification problems [14] and is improved to avoid overfitting by 
merging some sets of nodes [10]. A modification of the latter algorithm is applied 
to the problem of boosting in the presence of noise [7]. These algorithms are quite 
different from AdaBoost-like algorithms in that they grow a decision tree or a 
branching program as a master hypothesis that is far from the form of linear 
combinations. Moreover, the probability weightings they make are not based on 
the principle of minimizing any cost function. Later we give an evidence that 
these algorithms are inefficient in the sense that they make the complexity of the 
master hypothesis unnecessarily large. Aslam proposed a promising method of 
information-based boosting called InfoBoost [1]. InfoBoost is interesting because 
the weight update is obtained by minimizing the same cost function as AdaBoost 
and the master hypothesis has an “quasi-linear” form (the coefficients are not 
constants but depend on the values of weak hypotheses). 

In this paper, we generalize InfoBoost and propose a new class of information- 
based boosting algorithms. We first give a somewhat surprising observation that 
InfoBoost can be viewed as a process of growing a branching program that di- 
vides and merges the domain repeatedly. We show that the merging process used 
in InfoBoost can be replaced by any merging scheme with the boosting prop- 
erty preserved. So we have a class of boosting algorithms with various merging 
schema. We call any of them a BP.InfoBoost. BP.InfoBoost assigns to each node 
a weight as well as a weak hypothesis, and the master hypothesis is a threshold 
function of the sum of the weights over the path induced by a given instance. 
Note that this is different from the previous BP based boosting algorithms [10, 
7] where the master hypothesis is just a branching program. InfoBoost is a 
BP.InfoBoost using an extreme scheme that merges all nodes in each round. 
The other extreme that merges no nodes yields an algorithm for growing a de- 
cision tree. We particularly call this version DT. InfoBoost. DT. InfoBoost has a 
totally corrective update property. (The notion of totally corrective updates was 
originally proposed for a better booster in the margin based criterion [9].) Specif- 
ically, a totally corrective update makes the weight Dt+i for the next round so 
that all weak hypotheses obtained so far are uncorrelated with the 

sample with respect to Dt+i. This implies that any weak hypothesis ht+\ that 
correlates with the sample must contain novel information that hi, . . . ,ht do not 
have. Note that AdaBoost and InfoBoost make Dt+i so that only ht is uncor- 
related. So we can expect that DT. InfoBoost improves the master hypothesis 
much faster than InfoBoost. On the other hand, since the size of the decision 
tree may grow exponentially in t, DT. InfoBoost has more risk of overfitting than 
InfoBoost^. 



^ The decision tree produced by DT. InfoBoost uses the same weak hypothesis ht as 
the decision rule at all nodes of depth t — 1. So we can expect that DT. InfoBoost 
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There must be an appropriate merging scheme between the two extremes 
that takes advantages of the two extremes. In this paper, we propose a merging 
scheme that reduces the training error of the master hypothesis nearly as fast 
as the one we would have without merge while keeping the master hypothesis 
(branching program) in a moderate size. 

2 Preliminaries 

Let X denote an instance space and Y = {— 1,+1} the set of labels. We fix a 
sample S = {{x\,yi ), . . . , {xmiym)} C x F on which we discuss the perfor- 
mance of boosting algorithms. The procedure of boosting is described as in the 
following general protocol with two parties, the booster and the weak learner. 
On each round t, the booster makes a probability distribution Dt on the index 
set {!,... ,m} of the sample S and gives S with Dt to the weak learner; The 
weak learner returns a weak hypothesis ht '■ X ^ Y to the booster (denoted 
ht = WL{S, Dt))] The booster updates the distribution to Dt+i- Repeating the 
above procedure for T rounds for some T, the booster combines the weak hy- 
potheses hi, ... ,hT obtained so far and makes a master hypothesis. 

In this paper we measure the performance of a weak hypothesis ht in 
terms of the information that ht brings. For the entropy function, we use 
G{p) = 2^yp{l — p) defined for p £ [0, 1] so that we can interpret the progress 
of boosting as the product of conditional entropies. This function is used to 
describe the performance of various boosting algorithms [1,8,10,14]. Note that 
since the function G upper bounds Shannon entropy, small entropy in terms of G 
implies small Shannon entropy. For a probability distribution D over {1, . . . , m}, 
we sometimes consider X and Y as random variables that take values Xi and yt 
(resp.) with probability D(i). Especially we call Y the target random variable. 
Let the probability that the target Y takes value 1 under D be denoted by 

m 

p = Pr£,(?/* = 1) = '^D{i)\y, = 1], 

i=l 

where |C] is the indicator function that is 1 if the condition G holds and 0 
otherwise. Then the entropy of the target Y with respect to D is defined as 

Hd{Y) = G{p) = 2ijp{l-p). 

Furthermore, for any function g : X ^ Z for some countable set Z, the condi- 
tional entropy of Y given g with respect to D is defined as 

HofYlg) = ^ Proigixi) = z)HD{Y\g{x^) = z) = ^ Froigix^) = z)G{p^) 
zez zez 

where Pz = Praiyi = 1 | g{xi) = z). The entropies Hd{Y) and HD{Y\g) 
are interpreted as uncertainty of the target before and after seeing the value 



has less risk of overfitting than the classical top-down algorithms that produce trees 
with exponentially many different decision rules. 
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of g, respectively. So the mutual information Hu{Y) — Hu{y\g) represents the 
amount of information that g brings. In particular, since the distribution Dt we 
consider in this paper always satisfies = 1 , we measure the performance 

of the weak hypothesis ht by the conditional entropy Hj:,^{Y\ht). The inequality 
H£)^{Y\ht) < 1 implies that the weak hypothesis ht brings non-zero information 
about the target. 

Here we give some basic properties of the entropies. Let I? be a distribution 
over m} and g' be another function defined on X. Then, we have 

HD(Y\g) = 2 ^ \/YrD{g{xi) = z,y^ = l)PrD{g{xi) = z,y^ = - 1 ), ( 1 ) 

zez 

HD{Y\g,g')<HD{Y\g). (2) 

3 InfoBoost Growing a Branching Program 

In this section, we show that InfoBoost can be viewed as a top-down algorithm 
for growing a branching program. 

First we briefly review how InfoBoost works. In each round t, when given a 
weak hypothesis ht, InfoBoost updates the distribution to 



A+i(*) 



Dt{i) exp{-at [ht{x^)]ht{x^)yi) 



for appropriately chosen real numbers 1 ] and a[l], where Zt is for normaliza- 
tion. The two parameters 1] and Ot[l] are chosen so that the normalization 
factor Zt is minimized. Interestingly, for these choices of parameters we have 
Zt = The master hypothesis that InfoBoost produces is given by 

sign(FV(a;)), where 

T 

Ft{x) = '^at[ht{x)]ht{x). (3) 

The detail of the algorithm is given in Fig. 3. The performance of InfoBoost is 
summarized as in the following inequality [ 1 ]: 



T 

Pr[f(sign(FT(x,)) ^ y,) < P (r|/it), (4) 

t=i 

where U is the uniform distribution over {I, . . . , m}. This implies that the train- 
ing error of the master hypothesis decreases rapidly as T becomes large, as long 
as the weak hypotheses ht bring non-zero information with respect to Dt, i.e., 
F[£)^{Y\ht) < 1. In particular, if all hypotheses ht satisfy Ho^iYlht) < 1 — 7 for 
some 7 > 0, then Pr[/(sign(FT(a;i)) ^ yt) < e with T = (I/ 7 ) ln(l/e). 

Now we interpret InfoBoost as a process of growing a branching program. 
At the beginning of round t (except t = I), we have the set Lt-i of two leaves. 
When given ht, each leaf I G Lt-i becomes an internal node with two child leaves 
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Input: S = 

Initialize Di{i) = Ijm; 
for t = 1 to T do 

ht = WL{S, Dt); 

Choose real numbers at[— 1] and at[l] so that for any a € { — 1, 1}, 
a PrD{ht{xi) = a,yi = 1) 

Update Dt to Dt+i so that 

Dt+i{i) = Dt{i) exp(^-at[ht{xi)]ht{xi)yi) /Zt 
holds for 1 < i < m, where Zt is for normalization; 

Let Ft ■■ x^ at[ht{x)]ht{x)\ 

Output sign(FT(a;)) 



Fig. 1. The algorithm of InfoBoost. 



Li and just as in a decision tree growing algorithm. Note that the same 
hypothesis ht is used for the decision done at any I G L(_i and the outcomes 
— 1 and 1 correspond to the edges and respectively. Here we have 

four leaves. Then for each a G {—1,1}, the set {?o M G Tt-i} of two leaves 
are merged into a single leaf and it is labeled with weight a ■ at[a]. The new 
leaves corresponding to a = — 1 and a = 1 are called a (— l)-node and a 1-node, 
respectively. Thus, in the end of round t, we have the set L* of two leaves again. 
So each round consists of two phases, dividing and merging the subset of the 
instance space. See Fig. 2. 




Fig. 2. A branching program that InfoBoost grows. 
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By virtue of the branching program representation, we give natural inter- 
pretations of how the distribution Dt+i and the combined hypothesis Ft are 
determined. An instance x € X induces a path of the branching program based 
on the outcome sequence hi{x), . . . , hxix) in the obvious way. Let it '■ X ^ Lt 
denote the function that maps an instance x to the node in Lt on the path that 
X induces. Note that for the branching program generated by InfoBoost, it{x) 
depends only on ht{x) and so HT>{Y\it) = HD{Y\ht) for any D. Intuitively, the 
distribution Dt+i is determined so that it (and thus ht) is uncorrelated with Y 
and the total weight of leaf I G Lt is proportional to the uncertainty of Y at the 
leaf 1. More precisely, (as we will see later for a general branching program), the 
distribution Dt+i satisfies that 



Ho,^Ay\^t) = HD,^,{Y\ht) = l. 



(5) 



and for any I G Lt, 



= 1 ) 



= l)HDt{Y\it{Xi) = l) 

HD,{Y\it) 



(6) 



The property of (5) is essential because otherwise the next hypothesis ht+i would 
be the same as ht for which HT^^,^(Y\ht+i) < 1 but no additional information is 
obtained. The property of (6) is reasonable since this means that the leaf with 
larger entropy will be more focused with the hope that the next hypothesis ht+i 
would reduce the whole entropy efficiently. 

The combined hypothesis Ft given by (3) is also represented in terms of the 
branching program. For each node I G Lt, let w[Z] denote the weight labeled at 
1. That is, ru[?] = —at[—V\ if / is a (— l)-node and = at[V\ if Hs a 1-node. It 
is easy to see that for a given instance x, Ft{x) is represented as the sum of the 
weights of the nodes on the path induced by x. That is. 



T 

Yt{x) = ^w[£t(a;)]. 



(7) 



4 BP.InfoBoost 

In this section, we generalize the merging scheme used by InfoBoost and propose 
a class of boosting algorithms called BP.InfoBoost with various merging schema. 
Curiously, we show that the bound (4) on the training error remains to hold for 
BP.InfoBoost with any merging scheme. 

First we describe how BP.InfoBoost works. For convenience, we represent the 
set of leaves L* as a partition of {—1, 1}*. That is, a leaf I G Lt is & subset of 
{— 1, 1}*. For a given partition Lt, the function it : X ^ Lt is defined as 

it{x) = I (hi{x), . . . , ht{x)) G 1. 

At the beginning of round t, we have the set Lt-i of leaves. Now Lt-i may 
contain more than two leaves. When given ht, each leaf I G Lt-i becomes an 
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internal node with two child leaves l-\ and l\. Formally, the leaf la with a G 
{ — 1, 1} is a subset of { — 1, 1}* and given by la = {(u, a) | G 1}- So, an instance 
X that reaches a leaf I G Lt-i (i.e., £t-i{x) = 1) goes to la if ht{x) = a. Here 
we have twice as many leaves as in Lt_i. Then for any a G {—1,1} the set 
{la I I G Lt-i{ is partitioned somehow into for some ka and 

the leaves in each L'^ ^ are merged into a single leaf, which is formally given by 
Thus we have the set Lt of leaves of size k-\+ki. Fig. 3 illustrates how 
the branching program grows. A merging scheme is a procedure of deciding how 
the sets {la\l G Tt-ij are partitioned. For example, InfoBoost is a BP.InfoBoost 
with a particular merging scheme that lets each set {la \ I G Tj-i} be a partition 
of itself. 




Fig. 3. A branching program that a BP.InfoBoost grows. Note that any leaf in Lt is 
determined by a leaf in Lt-i and the outcome of ht. 



Each leaf I G Lt is labeled with the weight given by 




Pr^t (£t(xj) = l,Ut = 1) 

PvDt = l,yi = - 1 ) ■ 



(8) 



With the notations w and £t, we can write the update rule and the master hy- 
pothesis as in the same way as for the case of InfoBoost. That is, the distribution 
is updated to 






Dt{i) exp {-w[£t{xi)]yi) 
Zt 



(9) 



where Zt is for normalization, and the master hypothesis is sign(FV(a^))) where 
Ft is given by (7). The details of the algorithm is given in Fig. 4. Note that £t can 
be viewed as a domain-partitioning weak hypothesis and the weights w[l] derived 
in (8) are identical to the confidences assigned by Schapire and Singer [13]. 

Now we show that a BP.InfoBoost with any merging scheme has the same 
form of upper bound on the training error. We need some technical lemmas. The 
first one is immediately obtained by recursively applying the update rule (9). 
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Input: S = {{xi,yi),...,{xm,ym)} 

Initialize Di{i) = 1/m, Lq = {e}, £o ’■ x e; 
for t = 1 to T do 

ht = WL{S, Dt)-, 

Let la = {(n, a) \ V G 1} ioT I £ Lt-i and a G {—1, 1}; 
for a G { — 1, 1} do 

Partition {la \ I G Lt-i} into L/ j, . . . , 

= |Ui„eL' J<^\ ^ < i < ka,a G {-1, 1}|; 

Let £t ■■ X i-G- I G Lt where (hi(x), . . . , ht(x)) G 1; 
for I G Lt do 



Let 'w[l] = - In 
Update Dt to 



Prp^ {£t{xj) = l,yi = l) 
Prct {tt{xi) = Z, j/i = -1) ’ 



Dt+i{i) = Dt{i) e^p[-w[tt{xi)]yi) / Zt 
for 1 < i < m, where Zt is for normalization; 
Let Ft ■■ X 1 -^ ’w[£t{x)]; 

Output sign(LV(a;)) 



Fig. 4. The algorithm of a BP.InfoBoost. 



Lemma 1. Let Zt be the normalization factor in (9) and Dt+i he the distribu- 
tion obtained in the final round. Then, 

Dt+i{i) = 6xp exp(-FT(a;*)?/*) . 

The next lemma shows that the normalization factor Zt can be rewritten as the 
conditional entropy. 

Lemma 2. Z* = iLD,(L|£*) < HD,{Y\ht). 

Proof. First we show the equality part. By (8) we have 



Zt = ~^ Dt{i) exp {-w[lt{xi)]yi) 

i^l 

m 

= X! X! '^Dt{i){tt{xi) = l,Vi = aleyrp{-w[l]a) 

iGLt ae{-l,l} i=l 



= X! X! PT:Dt{f-t{.Xi) = l,yi = a) 

l^L± aG{ — 1,1} 



Prot {it{xi) = l,yi^ a) 



PrUt {t-t{xi) = l,yt = a) 
= 2 ^ \/Prnt (^t(xi) = l,yi = l)PrDj {£t(xi) = l,yt = - 1 ) 

= HoAY\it). 



1/2 



The last line is derived from (1). 
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For the inequality part, since the last component of any vector in the set 
lt{x) is ht{x), we have H£>^{Y\£t) = HDt{Y\£t, ht), which is at most H£>^{Y\ht) 
by (2). □ 

Now we are ready to give an upper bound on the training error. 

Theorem 1. Let Ft be the combined hypothesis of a BP.InfoBoost. Then 

T T 

Prc/(sign(Fr(a;0) ^ yi) < \{HnAY%)<X{Hn,{Y\h,). 

Proof. By Lemma 2, it suffices to show that Pr[/(sign(FT(a;i)) yf yf) < OtLi 
Since the initial distribution Di is uniform, we have 

m 

Pry(sign(FV(a:i)) ^ Vi) = ^ ii>i(i)|sign(FV(a;i)) yf Vij- 
Using [sign (Ur (xi)) yf yt} < exp(— FT(xi)j/i) and Lemma 1, we have 



m 

Pr£/(sign(FV(x*)) yf yi) < ^Di{i)exp{-FT{x,)y,) 

m 

= Zi - ■ Zt ^ -Dt+1 (^) 

i^l 

= Z\ • ■ • Zt . 



□ 

Theorem 1 means that the upper bound on the training error for a BP.InfoBoost 
is not larger than that for InfoBoost provided that the same distributions Dt 
and weak hypotheses ht are used. Moreover, when there is a large gap between 
HT)^{Y\£t) and F[T>^{Y\ht), then the BP.InfoBoost would improve the master 
hypothesis faster than InfoBoost. 

Next we consider any possible construction of the master hypothesis. Since 
any master hypothesis / : X — >■ {—1,1} should be constructed from the weak 
hypotheses hi,. . . , hr, it must hold that F[jj{Y\hi , . . . , Ht) < Hu{Y\f) and so 
the l.h.s. gives an information-theoretic limitation of the performance of /. The 
next theorem shows how much the entropy Flu{Y\hi, . . . , hx) is reduced as T 
grows. For an instance x, £i..t{x) is short for the path (£i{x), . . . ,£t{x)) G 

L\ T = T\ X * * * X Lt. 

Theorem 2. 



\{HoAY\tt) 



Hd^UY\^i..t)- 



Hu{Y\hi, . . . , hx) — Hu{Y\£i^^x) 
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Proof. The first equality holds since there is one-to-one correspondence between 
the set of paths of the branching program and the set of outcome se- 

quences (^hi{x), . . . , hrix)) . 

By (1) we have 

= 2 ^ = l,yi = l)Pr£,^_^j {£i..t{x,) = l,y^ = -1). 

Lemma 1 says that the distribution Dt+iA) depends on the path ti..T{xi) on 
which an instance Xi goes through. So for a particular path I = (li, . . . , It) € 
Li,,t and for any a € {—1, 1}, we have 



Prur+i (£i..T(xi)=l,:i/i = a) = ^P>T+i(i)[^i..T(xi) = l,yi = a] 

^Di{i){ii„T{xi) = l,yi = a] 



exp -a X] w[lt] 

\ t=i 

Z\ ■ • • Zj' 



exp -a X M^t] 

\ t=i 

Z\ ■ • • Zj' 



Pry = l,yi = a) . 



By multiplying the above probabilities with a = 1 and a = — 1 the exponential 
terms are canceled and we get 



Hdt+i{Y\£i..t) 



Hu{Y\lT.t) 



which completes the theorem. 



□ 



Theorem 1 and Theorem 2 imply that our master hypothesis sign(f(a;)) would 
be nearly optimal if (P|-^i..t) is not too small. Although we do not 

know a lower bound on i^DT+l (P|^i..t) in general, the distribution satisfies 
Hdt +1 (XAt) = 1 as shown in (5) for the case of InfoBoost. That is, BP.InfoBoost 
updates the distribution so that £t is uncorrelated with Y . The next lemma 
shows that a BP.InfoBoost makes Ht+i so that It is uncorrelated with the sam- 
ple. This implies that the weak hypothesis ht+i (and £t+i as well) with non-zero 
correlation must have some information that the function £t does not have. This 
gives an intuitive justification why the improvement of the master hypothesis 
for a BP.InfoBoost is upper bounded by the product of HT,^{Y\£t). 

Lemma 3. For any 1 <t <T , a BP.InfoBoost updates the distribution to Dj+i 
so that HT,^j^^{Y\tt) = 1 holds. 

Proof. It suffices to show that for any leaf I G Lt, Prot+iiVi = 1 I £t{xi) = 1) = 
1/2, or equivalently 



Pryit+i {£t{xi) = l,yi = l) = Pryit+i {£t(xi) = l,yt = -1) . 



( 10 ) 
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By (9) and (8), it is straightforward to see that the both sides of (10) are equal 
to 

\/PrDt {£t{xi) = l,y^ = l)PrDt {1-ti.Xi) = l,y^ = -1) 

Zt 

□ 

Note that [7] also suggests to make Dt+i be uncorrelated with £t but in a 
trivial way: Dt+i is restricted to a single leaf I and set so that = 1 | 

1) = 1/2. This means that the weak leaner must be invoked for all leaves in 
Lt with different distributions. On the other hand, our distribution (9) assigns 
positive weights to all nodes I unless Hjj^(Y \ = 1) = 0, and the weak 

leaner is invoked only once for each depth t. 

5 DT.InfoBoost 

As we stated, InfoBoost is a BP.InfoBoost with a particular merging scheme 
that merges all nodes li into one leaf and all nodes /_i into the other leaf. 
Since HD^{Y\£t) = HDt{Y\ht), InfoBoost takes no advantage of the branching 
program representation. 

Next we consider the BP.InfoBoost with no merging phase, i.e., DT.InfoBoost. 
In this case, since a path of the decision tree is uniquely determined by the leaf 
of the path, we have H£>^{Y\£t) = HDt{Y\hi, . . . , ht). Since . . . ,ht) 

is likely to be much smaller than H]:)_i.{Y\ht), DT.InfoBoost would improve the 
master hypothesis more efficiently than InfoBoost. Lemma 3 implies 

HD,i{Y\hi, ■■■,ht)= = 1. 

So, DT.InfoBoost has a totally corrective update. This means that all the hy- 
potheses hi,. .. ,ht obtained so far are uncorrelated with Y. So the next hypoth- 
esis ht+i will bring truly new information that h\, . . . ,ht do not have, provided 
HDt+iiX\ht+i) < 1- This contrasts with the case of InfoBoost, where ht+i is 
only guaranteed to bring information that ht does not have. Moreover, with the 
fact that Hi:).^^^{Y\£t) = 1, Theorem 1 and Theorem 2 gives the performance 
of DT.InfoBoost nicely as 

T 

Pr[/ (sign(FT(xi)) ^ y^) < Hu{Y\hi, . . . ,hr) = (F|/it). 

t=i 

This implies that DT.InfoBoost extracts information very effectively from the 
weak learner. Actually, we observe in experiments that DT.InfoBoost reduces the 
generalization error as well as the training error very rapidly in early rounds. Un- 
fortunately, however, the generalization error sometimes turns to be increasing 
in later rounds. This is because the complexity of the master hypothesis grows 
exponentially in the number of rounds. 

Generally, there is a tradeoff between the size of Lt and Hu_,.{Y\£t). More 
precisely, the larger the size of Lt is (i.e, the less nodes we merge), the smaller 
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HD^{Y\lt) becomes (i.e., the more rapidly the training error decreases). In the 
next section, we propose a merging scheme that makes {Y\tt) nearly as small 
as the one we would have without merge, while keeping Lt in a moderate size. 



6 A Merging Scheme 

Assume we are given a weak hypothesis ht in the tth round and each node 
I G Lt-i grows two child leaves li and l-i. Now we have twice as many leaves 
as Lt-i- Let the set of leaves be denoted by 

Lt = {h \ I G Lf-i} U {l-i I I G Lt-i} 

and let £t '■ X ^ Lt he defined analogously, that is, £t{x) is the leaf in Lt that 
instance x reaches. If we merge no nodes in Lt, then we would have Lt = Lt and 
£t = £t- So, HDt{Y\£t) gives a lower bound on HD^{Y\£t) for £t induced by any 
merging for Lt- Let Hot(Y\£t) = 1 — 7 t. The merging scheme we give guarantees 
that Hjj^{Y\£t) < 1 — cjt for some constant 0 < c < 1. For this purpose, we 
could use the merging method developed by Mansour and McAllester [10]. 

Lemma 4 ([10]). For any function f : X ^ Z, any distribution D over X xY, 
and for any 0 < A, <5 < 1, there exists a function g : Z ^ M such that 

HD{Y\gof)<{l + X)HD{Y\f) + S and |M| = 0((1/A) log(l/<5)). 

Clearly, letting £t, Lt and Lt correspond to /, Z and M, respectively, we have 
a merging scheme induced by g so that g o / = So by choosing A = (1 — 
c) 7 t/ 2 (l — 7 t) and <5 = (1 — c)^tl‘2-, we have Hot {Y\£t) < 1 — c^t- Unfortunately, 
however, the size of Lt is too big, i.e., \Lt\ = 0 ((l/ 7 t) ln(l/ 7 i)). In the following, 
we give a new merging method that guarantees the same bound on H£>^{Y\£t) 
with significantly small Lt, i.e., \Lt\ = 0 (ln(l/ 7 t)). 

Theorem 3. Let f : X ^ Z with Hi){Y\f) = 1 — 7 for some 0 < 7 < 1. Then, 
for any 0 < c < 1, there exists a function g : Z ^ M such that 

HD{Y\gof)<l-c^ and \M\ = O 1/^0 ' 

Proof. For each z G Z, let 

Pz = PrD{f{xi) = z) and = Pr_D(j/* = 1 | /(x*) = z). 

Note that 

HD{Y\f) = J2pMq^) = ^-i, ( 11 ) 

zez 

where G{q) = 2^ g(l — q) is our entropy function. Let a = 2c/(l — c), cq = 0 
and for each f > 1 , let 

/1 + aV”' C7 



a 



a 
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Let k be the smallest integer such that Cfc > 1. Now we define the function g. 
For each 1 < j < A:, let 

S-j = {z G Z \ pz < 1/2, e^-i < 1 - G{q^) < €j} 

and 

Sj = {z G Z \ > 1/2, ej-i < 1 - G{q^) < ej}. 

That is, Z is partitioned into S-k LI ■■■ U Sk- Let M = {j, —j\l<j< k} 
and for any z G Z, let g{z) = j such that z G Sj. It is easy to see that \M\ = 

^ ( ^°1og{i+r/c)^^^ ) ■ suffices to show that HD{Y\g o /) < 1 — cy. 

For each j, let 

Pi = 

z^S j z^S j 

where p^ = Pz/p{Sj) with z G Sj. Then, 

HD{Y\gof)=J2piS,)G{g,). 

jSM 

Since G{pj) < 1 — e^-i, we have 

^ p{Sj)G{pj) < 1 - XI 

j^M j^M 

If XeM p{Sj)ej-i > cy, then we are done. So in what follows, we assume 

X p(^3)g-i < n- ( 12 ) 

jGM 

Since both G{p,j) and PzG{qz) are in the range (1 — tj, 1 — e^-i], we have 

G{pj) < X Pz^(9z) + “ ^i-i 

z^Sj 

and this with (11) gives 

Ho{Y\g o /) < 1 - y + X P{S,){ej - e^.i). (13) 

j&M 

By our choices of ej, 

tj-i = a{cj - ej-i) 

holds for j > 2. Plugging this into (13) and using (12), we get 

Ho{Y\g o/)<l-y+^ + ei(p(S'_i) +p{Si)) < 1 - cy. 

□ 
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Again, letting £t and £t correspond to / and gof, respectively, we have a merging 
scheme as desired. Intuitively, the parameter c controls the tradeoff between 
HDt{Y\£t) and \Lt\. That is, if we let c close to 1, then H£>^{Y\£t) is as small as 
H£)^{Y\£t) while \Lt\ is unbounded. This implies that the BP.InfoBoost behaves 
like DT.InfoBoost. On the other hand, if we let c close to 0, then Hu^{Y\£t) 
is unbounded (naturally bounded by H£>^{Y\ht)) while \Lt\ is as small as a 
constant. Thus, the BP.InfoBoost behaves like InfoBoost. So, we can expect 
that BP.InfoBoost with appropriate choice of parameter c outperforms both 
InfoBoost and DT.InfoBoost. 

Finally, we remark that we develop the merging method in a different philos- 
ophy from the one of Mansour and McAllester [10] in the following two points. 

— Our merging scheme is based on the current distribution Dt while that of [10] 
is based on the uniform distribution over the sample. 

— Our merging scheme aims to make the function £^ perform not much worse 
than £t as a weak hypothesis. On the other hand, [10] aims to make £t 
perform getting better as a master hypothesis, so that Hu{Y\£t) converges 
to 0. So their method needs much more leaves and the complexity of the 
branching program becomes unnecessarily large. 



7 Remarks and Future Works 

1. Our analysis suggests that the weak learner should produce ht so that 
HD^{Y\£t), rather than HD^{Y\ht), is as small as possible. 

2. Our algorithm uses the same cost function as AdaBoost and 

InfoBoost. This seems to nicely mesh with our entropy function G{q) = 
2a/ 9(1 — q)- What is the relation between the choices of cost functions and 
corresponding entropy functions? 

3. The combined hypothesis Ft{x) = can be rewritten as the 

dot product Ft{x) = W ■ F£{x) where W and H(x) are vectors indexed by 
paths a € Li X ■■■ X Lt and its cr = (ai, . . . , fT 7 ’)th components are Wa- = 
Y.t=i and H„{x) = 0*=! \£t{x) = at}, respectively. This defines feature 
maps from the node space to the path space. Can we analyze BP.InfoBoost 
in terms of the path space? 

4. We need to give a criterion of choosing the tradeoff parameter c (that may 
depend on round t). 

5. We need to analyze the complexity of the master hypothesis, say, in terms 
of VC dimension. 

6. We are evaluating the performance of BP.InfoBoost on data sets from the 
UCI Machine Learning Repository. Preliminary experiments show that the 
BP.InfoBoost with the merging scheme developed in Section 6 performs well 
as compared to InfoBoost and DT.InfoBoost. 



Acknowledgments. The authors are grateful to Tatsuya Watanabe for showing 
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Abstract. Linial, Mansour and Nisan introduced a learning algorithm 
of Boolean functions in AC° under the uniform distribution. Following 
their work, Bshouty, Jackson and Tamon, also Ohtsuki and Tomita ex- 
tended the algorithm to one that is robust for noisy data. The noise 
process we are concerned here is as follows: for an example {x, f{x))(x G 
{0, 1}", f{x) G { — 1, 1}), each bit of the vector x and the fix) are obliged 
to change independently with probability rj during a learning process. 
Before our present paper, we were on the assumption that the correct 
noise rate rj or its upper bound is known in advance. In this paper, we 
eliminate the assumption. We estimate an upper bound of the noise rate 
by evaluating noisy power spectrum on the frequency domain by using 
a sampling trick. Thereby, by combining this procedure with Ohtsuki 
et ffl/.’s algorithm, we obtain a quasipolynomial time learning algorithm 
that can cope with noise without knowing any information about noise 
in advance. 



1 Introduction 

Linial et al. [5] proposed a learning algorithm of Boolean functions in AC^ under 
the uniform distribution. Following their work, even when we are suffered from 
classification and attribute noise with the noise rate rj < 1/2, Bshouty et al. [1] 
showed a learning algorithm with exact value of rj. Ohtsuki et al.[7], [8] presented 
an algorithm which required only its upper bound tjq on rj. Bshouty et al. [2] 
also showed how to eliminate their previous assumption, provided that the upper 
bound of the noise rate is known. 

The noise rate rj exists certainly in0<?7<l/2. Thus it could be possible to get 
rid of such a requirement for an upper bound rjQ on the noise rate rj. 

Until now a lot of learning problems under noise have been investigated. Laird [4] 
showed that we can estimate such an upper bound in the presence of classification 
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noise. Then what happened in the case where the noise process contains not only 
classification errors but also attribute errors in an example? For estimating noise 
rate rj and biased Fourier coefficients, Bshouty et al.\2] needed some sort of noise 
oracles on the assumption that a learner knows the number of necessary examples 
in advance. On the other hand, Ohtsuki et al. [7], [8] showed that a learner can 
estimate noise rate rj from only examples containing noise. 

In this paper, we eliminate the assumption that an upper bound ijo on the 
noise rate is known. We propose an estimating method that compares values 
between prediction and experiment on the frequency domain. Note that we can 
apply this estimating method to other classes, in which the power spectrum of 
the Boolean function is concentrated enough in low degrees. By combining this 
procedure with Ohtsuki et al.’s learning algorithm, we can obtain an algorithm 
which assumes only 77 < 1/2 with respect to unknown noise rate ij. If we do not 
need to know exact rj or an upper bound 770 on 77 in advance, it is more reasonable 
that we only require that examples may be changed with an unknown probability 
rj. A preliminary version of this paper appeared in [6]. 

2 Preliminaries 

A Boolean function f on n variables is a mapping : {0,1}" — >■ {—1,1}. For 
X G {0, 1}", let |x| be the hamming weight of x , i.e. the number of 1 contained 
in X, and x[i] be an z-th bit of x. The bitwise exclusive-or of two zi-bit vectors 
x^ and x^ is denoted x^ © In this paper, we have changed the range from 
{0,1} to {-1,1}. 

Let F„ be a concept class over {0, 1}" and let f,hG F^- We consider the following 
learning model. A learning algorithm receives examples selected uniformly at 
random, and then outputs hypothesis h and halts. An example {x, f{x)) is a 
pair of input x G {0,1}” and output f{x) G {1,-1}. Pr[/(x) yf h{x)] is the 
probability that h differs from the target / under the uniform distribution and 
we denote Pr[/(x) yf h{x)] = error{h). The success of the learning is evaluated 
by two parameters, error parameter e and confidence parameter 6. For any / G 
F„,0<e<l,0<5<l,a learning algorithm L outputs a hypothesis h within 
e errors with probability at least 1 — (5. 

Also, the hypothesis h is called e-good if and only if error{h) < e, otherwise 
e-bad. A concept class is said to be learnable if and only if there exists some 
learning algorithm for F„. 

3 A Learning Algorithm via the Fonrier Transform 

Let / be a real valued function from {0, 1}” to [—1, 1]. For a G {0, 1}”, define 
the orthonomial basis Xa(x) = (-1)“'® {a ■ x = '^iOi[i]x[i\) over {0,1}". 
Then by using Fourier methods, any Boolean function / can be represented by 
f{^) = X)a f{^)Xa{x) , where f{a) is the Fourier coefficient of / at a. We call 
|a| the frequency of a. 
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3.1 Learning Constant Depth Circuits [5] 

Let / be a Boolean function that can be computed by a constant depth, poly- 
nomial size circuit (AC^ circuit). The learning algorithm by Linial et aL[5] is 
based on the following ideas. First, let a hypothesis be of a form h{x) = 
over {0,1}" for a G {0,1}", if we can approximate /(a) well, 
then the hypothesis h can be an e-good hypothesis. Secondly, we can compose 
h by estimating only 0{n^) coefficients for k = 0(log‘^n/e) since the Fourier 
coefficients of AC^ are concentrated enough in low degrees ( J2a /(o)^ < e/2 
for any a s.th. |o;| < k ). Thirdly, since /(a) = E[f{x)xa{x)], we estimate 
Qa = f{x^)Xa{x^))/xn from m examples {{x^ , f{x^))}JXi, where is an 

estimated value of f{a) and used as h{a). Also, from Chernoff bounds [3], we 
need m = (4n^/e) ln(2n^/(5) examples to guarantee that 



Pr 



|/(a) - Oa| < 




for each a s.th. |a| < k 



> 1-6 . 



4 Noise Rate and Its Upper Bound 

In literatures [7], [8], a part of [1], [2], and here, we are concerned with both 
attribute and classification noise model. A learning algorithm makes a call to 
oracle EX() and receives an example {x,f{x)) from EX() under the uniform 
distribution. However, the noise process changes an example {x, f{x)) to a noisy 
example {x ® xn , f{x)yN) {xn G {0,1}",2/at G {1,-1} ) with probability 

n n n 

XN[i\ = i a:N[d=0 VN = — i yiv = l 

where we call p the noise rate with 0 < ?7 < 1/2. 

Laird [4] showed that in some learning problems from noisy examples, its 
sample size increases inversely proportional to 1 — 2?7, and thus we need to know 
noise rate rj exactly or its upper bound r/o on rj to determine a proper sample 
size in advance. 

An upper bound ijo is a “looser” estimate of unknown noise rate p. Could not 
we estimate such an upper bound by observing only noisy sample ? 

Laird also showed that we can estimate such an upper bound under the clas- 
sification noise model. The classification noise process is that for an example 
(x,f{x)), only f(x) is changed with probability rj. The process of estimating 
such an upper bound makes a learning algorithm try for successively larger val- 
ues of rjo- If one value of such successively larger values is close enough to real 
noise rate rj, the best hypothesis among the outputs of the learning algorithm 
should be e-good under the classification noise model. 

In the case where examples are suffered from not only classification errors but 
also attribute errors, sometimes unacceptable examples are generated. There are 
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examples seemed to be generated from the target function on the one hand, but 
at the same time there are considerable examples that negate the target function. 
Although we can ignore such examples by requesting proper amounts of exam- 
ples even if the noise rate rj is close to 1/2 under the classification noise model, 
it may happen that the number of examples that negate the target function 
exceeds the number of examples correspond to the target function depending 
on the target function and noise rate under the both classification and attribute 
noise model. Because of this, we need to restrict target functions and noise rate, 
to acquire the hypothesis with desired accuracy, or we can’t distinguish whether 
an example from noisy oracle is contained originally in correct target function 
or not. Hence, it is difficult to use an error check on such unacceptable exam- 
ples, since the hypothesis that accounts noisy sample best is not necessarily the 
desired hypothesis. For further details of this, see Chapter 5: Identification from 
Noisy Examples in Laird [4]. 

Instead of using the above estimating process, we develop a procedure that esti- 
mates an upper bound of rj by using a sampling trick on frequency domain and 
combine a robust fc-lowdegree algorithm [7], [8] that learns under noise to 
be shown in 4.1. As a result, we can obtain a learning algorithm that performs 
well without any noise information in advance. 



4.1 An Algorithm with an Upper Bound rjo as an Input 

We show robust fc-lowdegree algorithm [7], [8](rfcl) that works under the con- 
dition that an upper noise bound tjq is given to a learner. The rfcl algorithm 
searches a natural number t by increasing it by At step by step. 

Robust fc-Lowdegree Algorithm [7], [8] (e, S, rjo): 

Define m = 1568n^+^ (1/(1 ~ (1/e) In ((2n^)/j) , 

7„ = In (1 -I- y/e/2) /{k + 1), 

At = In [(1 - (1 - 2?7 o) 7„)^ -b ((1 - 2'qf)^/e) / {An)] ^ /{Ak + A). 

Stepl. Request noisy examples {{x^ f{x^)y)^)}jhi. For each a s.th. |o;| < 

k, calculate Ca = ^ f{x^)VNXa{x^ © ■ 

Step2. Initialize t = 1, then increase t by At until t satisfies the following 
inequality or t > Regard such t as an estimated value of 7/27- 

i:( i: > (1 - (1 - 2,0)7.)^ + ^ 

steps. Output the hypothesis h{x) = sign(^l^l<j, Catl“l+^Xo,(a;)) , and halt. 
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For a learning from noisy sample, a learning algorithm is permitted to run 
in the time quasipolynomial in n, 1/e, \/5 and 1/(1 — 2rj). The running time of 
the robust fc-lowdegree algorithm [7], [8](rA:l) is shown as follows. 

Theorem 1. [Ohtsuki et al.[7], [8]] Let / be a Boolean function and assume 
that 0 < e, (5 < 1, and an upper bound rjo s.th. 0 < rj < ijo < 1/2 is given. 
Request the following m examples, then the rfcl algorithm runs in 0{mn^) time 
and then outputs hypothesis h within e errors with probability at least 1 — <5, 
where k = 0(log'*(n)/((l — 2rjo)e)) . 



m = 1568n'=+2 



— \ 

- 



2fc+4 



- In 



f2n'^ 



V <5 



Sketch of the Proof. First, consider the influence of noise. 



Lemma 2 [Ohtsuki et al. [ 7 ], [8]]. Let / be a Boolean function and assume 
that 0 < 77 < 1/2. If xn G {0,1}" and t/at G {— 1,+1| are produced with 
probability ria;„[i]=i ^ n^„[i]=o(l “ v) , riy^=-i “ v) respectively, 

then from noisy examples {{x^ © f{x^)y^^)}'Jki we get: E[f{x)yNXa{x © 

a;7v)] = (1 - 277)l“l+i/(a) . 

Proof of Lemma 2. Since x, xn and t/at are independent each other, we have 
the above Lemma 2 immediately. □ 

Thus, we have true coefficients /(a) by estimating 1/(1 — 2r]). Let Cq be an 
estimated value of the biased Fourier coefficient (1 — 2?7)l“l+^/(a) and t be an 
estimate of 1/(1 — 2rj). Note that t > 1. Assuming ~ 0 for a s.th. 

|a| > k, we estimate t that satisfies J2i=o (X)|a|=i — 1. Such t should 

be a good approximator of 1/(1 — 27?) since for any a s.th. |o;| < k /(a)^ ~ 1. 

Lemma 3 [Ohtsuki et al. [ 7 ], [8]]. Let / be a Boolean function. As- 
sume that a known upper bound such that 0 < 77 < 770 < 1/2 is given and 
k = 0(log'^(7i)/((l — 2r]o)e)). If we have t such that \t— 1/(1 — 2t 7)| < and Ca 
such that |cq — (1 — 277)I“I+^/(q;)| < 7^ , where = ln(l + yjel‘i)j{k + 1) 
and 7c = ((1 — 2?7)^+^i/e)/(2877'\/n^), then we obtain a hypothesis h = 
sign {E\a\<kCat^°''''^^Xa{x)) such that error(ft,)< e . 



Proof of Lemma 3. Since the value 770 (770 > 77) is given, we determine the 
value of k and 7c. Thus we have /(a)^ < ((1 — 277o)e)/(28n) for any a s.th. 
|a| > k. Consequently, by evaluating errors between such esitmated values Cq, and 
true values f{a), and considering the fact that /(a)^ < ((1 — 277o)e)/(28n) 
for any a s.th. |o;| > k, we have the above Lemma 3. 

□ 
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Lemma 4 [Ohtsuki et al. [7], [8]]. Let / be a Boolean function. Assume 
that an upper bound rjo is given, k = 0(log‘^(n)/((l — 2?7o)e)). Also assume 
that we have Cq- such that |cq — (1 — 2?7)l“l+^/(a)| < 7c , where 7c = ((1 — 
2?7o)^'''^-\/e)/(28n-\/n^). If t satisfies the following inequality, then \t — 1/(1 — 
27)1 < 7«> where 7„ = ln(l + ^ftj 2 )l{k + 1). 

i :( i : > (1 - (1 - 2 , 0 ) 7 .)^ + . 

Proof of Lemma 4. Consider the contraposition of the above Lemma 4. 
After pluging such t (t < 1/(1 — 2 rj) — 7„) into 
X)iLo(X]|a|=i we have the following inequality. 

E < (1 - (1 - 2 ») 7 .)' + ^ 

i=0 | q ;|=2 

Hence we have the above Lemma 4. Note that we do not consider the case where 
1/(1 — 2 ri) + 7„ < t , since we start searching t from t = 1 ( 1/(1 — 2 rj) > 1 ). 

□ 



Consequently, the problem of estimating 1/(1 — 2 rj) can be regarded as the 
problem of searching a real value t that satisfies the condition in Lemma 4. 
In order to avoid estimating t s.th. considerably deviates from real 1/(1 — 2 r]), 
Ohtsuki et al.[ 7 ], [8] estimated such t by increasing it by At step by step, where 



At = 



1 

4A(+4' 



In 



(1 ~ (1 ~ 2 ? 7 o ) 7 „) + 



2 , (l-2?7o)Ve 



n -1 



4n 



Lastly, from Chernoff Bound [3], if we have the following m examples 
given below, then we have |cq, — (1 — 2?7)l“l+^/(a) | < 7c , where 7c = ((1 — 

2 %)^'''^-\/e) /{ 28 nV^). 



As a whole. Lemma 3 holds from the above sample size m and Lemma 4, 
and thus we can complete the proof of Theorem 1. 

Q.E.D. 



5 Estimating an Upper Bound rjo 

The rfcl algorithm needs to know an upper bound tjq on 77 in advance (which 
should be given as an input). 

In this chapter, provided that rj < 1/2, we present a method which estimates 
such an upper bound t]q. 
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5.1 How Do We Guess an Unknown ry? 

On the assumption that the noise rate rj is unknown for a learner, we will begin 
with a simple observation. Could not we get any information with respect to noise 
from only noisy sample? It is the point to be observed that for a s.th. |a| < k 
a noisy power spectrum estimated from noisy sample is an experimental 

value of X)q( 1 ~ , namely 

^ ~ ^(1 - 277)2|“l+2/(a)' = (1 - 2 r,)"l“l +2 ^ f{af . 

OC Ct Q 

This implies that for a s.th. |a| < k offers a key to estimate rj. We can 

guess how those Fourier coefficients are reduced by the effect of the noise. 

Let us now look into estimating an upper bound rjQ in detail. First, we analyze 
why the rkl algorithm needs such an upper bound rjQ. Secondly, we implement 
rj' {0 < rj' < 1/2) as an estimate of an upper bound rjQ. Then we claim that if 
such rj' satisfies a certain condition in the frequency domain then the estimate 
rj' is larger than or equal to rj. Thirdly, we present a procedure for estimating 
such an upper bound. 

5.2 Why the rfcl Algorithm Needs an Upper Bound r/o? 

The rfcl algorithm needs the following two conditions with respect to a known 
upper bound rjo. 



i For a target Boolean function /, the rfcl algorithm sets fc = 0(log‘^(n)/((l — 
2r]o)e)) so that for a s.th. |o;| > fc /(a)^ < ((1 — 2 ? 7 o)e)/( 28 n). 

ii The rfcl algorithm requests the size m examples to guarantee that |ca — (1 — 

2t^)I“I+i J(q;)| < = ((1 — 2 ? 7 o)^'''^\/e)/( 28 n-\/n^)) with high probability. 

Namely, the rfcl algorithm needs rjQ to decide proper values of fc and m. 

Instead of using a known ijo, we shall determine the values of fc and m by 
introducing a predicted value rj' (0 < 77 ' < 1 / 2 ) aiming at rjo. 

Since we have no information how close rj is to 1/2, we will use an iterative 
search with 77 ' = 0, 1/4, 3/8, 7/16, . . ., and in each step we determine fc and 777 - 
step by step. As a result, we get predicted 77 ' such that 77 ' > 77 and satisfy the 
above two conditions before Step 2. 

Here, the main property we want to establish is the following Theorem 5. 

Theorem 5. Let 77 ' (0 < 77 ' < 1/2) be a predicted value of 77 . We request the 
following 777(77', fc) examples 



777(77', fc) = 1568 t 7 ^^^^ 



1-277') 



4fc+4 



In 



2n 



2k 
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then by estimating biased Fourier coefficient Cq for a s.th. |o;| < fc, we have 
that rj' > r] ii and only if (1 — ~ s{r]',k), where s{t]' , k) = 

((1 — 2?7')^^+^-ye)/(14n) is some error bound between an estimate Xa 
an expected value (1 — 2?7')^l“l+^ Xa 

Proof. There are the following two claims to prove Theorem 5. 

i In any frequency |a| for any a s.th. |o;| < k, an error between an estimate 
Xq from noisy sample and an expected value Xa(l ~ 2?7)^l“l+^/(a)^ is 
within s{t]',k). Namely, for any a s.th. |o;| < k 

X! ^ (1 - /(a)2 . 

a a 

ii For a predicted value ij' , t]' is larger than or equal to 77 if and only if 

(1 — < ^{c^l Of s.th. |a| < k} — s{r]' , k) . 

oc 

Proof of i) Consider now the first claim. Let t]' be a predicted value of an 
upper bound rjQ, if we request the above m{r]' , k) examples, then we have an 
estimate c„ within 7' = ((1 — ^/e) / ( 28 n^~^^) error to its expected value 

(1 — 2r;)l“l+^/(a) with probability at least 1 — <5. For each a s.th. |a| < k, we 
can guarantee |(1 — 2?7)I“I+^/(q;) — Cq,| < 7' from Chernoff bound [3]. Since 
0 < 1(1 — 2?7)l“l+^/(a)|, \ca\ < 1 for any a s.th. |a| < k, we get: 

|(1-27)^H+2;(^)2_,2| 

= 1(1 - 2?7)I“I+V\a) + Ca||(l - 2?7)I“I+V^(a) - Ca| < 2y' 

^ y: 1(1 - 2,)»I“I+V(a)^ n“ = . 

OC 

From these for any a s.th. |o;| < k, we get the estimate Xa within 
5(77', k) = -\/e error bounds to its expectation Xa(^ ~ 277)^l“l+^/(a)^. 

Consequently, the above sample size mfr]' , k) is large enough to guarantee 
for any a s.th. |o;| < fc : 

^|(l- 277)2|“l+2 /(a)^-c 2| <s{r]',k) . 

a 

Hence we get the following inequality for any a s.th. |a| < k. 

fc) < (1 - 2 t 7)2|“I+2 Y . 
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Proof of ii) Then consider now the second claim. We assume that a real value 
P (0 < _p < 1) satisfies for any a s.th. |a| < fc : 

Oc 

Since for any a s.th. |a| < k is an estimate to its expected value (1 — 

27 y) 2 |a |+2 /(a)^ and ^ /(a)^ is fixed for each /, for a real value p {0 < p < 

1), we define p = (1 — and get 

(1 - 27,')^'“'+" < ^ c2 - s(r/', fc) < (1 - 2r,)2|“l+2 ^ f{af 

Oc Oc 

^ (1 - 27?')^'“'+^ < (1 - f{af 

CK 

/ 1 _ On' \ 

^ [ l_2n ) ( since 0 < ^ /(a)2 < 1) 

^ Oc Oc 

^ Tj' > rj . 

We can now give a proof of Theorem 5 by using the above two claims. 

Q.E.D 

In the following subsection, we shall present a procedure that halts satisfying 
conditions in the above Theorem 5. 



5.3 A Procedure for Estimating an Upper Bound 

To put it plainly, the procedure compares the predicted power spectrum from t]' 
with the observed power spectrum from noisy sample, and then judges whether 
such Tj' is larger than or equal to noise rate rj or not. 



A Procedure for Estimating r]'{e,S): 
Define mr{r]',k) and s{rj',k) as follows, 



mr{rj' , k) = 1568 



1 

l-2r/' 



4 fe +4 /^ 2 fc +2 



In 



2r+1^2fe 



s(?7',fc) 



(1 - 2 /;')^''+^ 



14n 



Input: (5 (0 < 5 < 1), e (0 < e < 1) 

Output: rj' such that r}' > p 

Procedure 

1 r -=1,^ ■=Q,k = 0 (log'' 

2 (round r) Repeat until the halt condition is satisfied. 

2.1 Request m{p',k) examples. 
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2.2 For each a s.th. |o;| < k, obtain 



Cq. 



.. Ill 

m 

i=i 



2.3 p := (1 — 2 r]'), s := s{r]', k) 

2.4 If the following inequality is satisfied in some frequency i (0 < i < fc), 
then halt and output rj' . 

< y: ci - , 

O' s.th. |a|=z 

2.5 Else, execute the followings and go to the next round, 

r := r + 1, 7?' := i - k = 0 (log'^ {i-2r,'}e ) 

Theorem 6. Define rrir{ri',k) as follows. 



mr{ri' , k) = 1568 



1 

1 - 2 r]' 



4 fc +4 /^ 2 fc +2 



In 



^ 2 k 



Then with probability at least 1 — 5, the above procedure halts in or before 
round tq = 2 + [log2(l — and outputs 77' such that 77' > 77. 

Proof. To begin with, we will examine the probability that for any a in any 
round r, 

|(l- 277 )l“l+i/(a)-Ca| >7' . (1) 

After r round independent trials, there are 2'’ possible rounds. So by dividing 
the probability that formula (1) occurs, we can bound the probability within 5/2 
for any round. Namely, the sample size mr(rj' , k) is defined to guarantee for any 
a s.th. |a| < fc in any round r that 



Pr[^|(l- 277)l“l+i/(a)-c,,|>7']< 



5/2 

2’' X 



y. 2 ^ 



5 

2 ■ 



We now turn to discuss the procedure for estimating an upper bound. Therefore 
we show that the procedure satisfies the followings with probability at least 1 — 5. 



First, we show that if the procedure halts satisfying the halt condition in 
Step 2, then its output 77' is larger than or equal to 77. 

Assume that the procedure halts in round tq. By using 777,^(77/ k) examples, the 
procedure guarantees < (1 — 2t7)^I“I+^ + s{rj', k) with probability at least 

1—5/2. Also the halt of the procedure implies (1— 2?7')^l“l+^+s(77', k) < for 

any a s.th. |a| < k. Thus, from these inequalities together, we get the following 
inequality and hence rj' > rj. 

(1 — 277')^!“!+^ + 5(77', A;) < (1 — 2t7)^I“I''‘^ + s{t]', k) . 
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Secondly, we show that procedure halts in or before round ro = 2 + |"log 2 (l — 
Now assume that rj[. = ^ — in round r and satisfy 

(1 - . 1 = (1 - 2 t 7 ) 2 |“I +2 • 1 . ( 2 ) 

Namely, rj'^ = r] in round r. 

In r + 1 round, since = \ — 57 ^, we get (1 — 2 ? 7 ^_|_;^)^l“l+^ = (1 — 

2 ? 7 ^)^I“I+^. Compared with the equality (2), the lefthand side is reduced by 
times. Therefore, in r + 1 round, we show that the following inequality 
(3) holds in some frequency |a| = i for any a s.th. |o;| < k. 

/ 1 \ 2|a|+2 

(l- 2 r;;) 2 |“l +2 f-j < (l-277)2|“l+2^/(a)2 . ( 3 ) 

At that time, the procedure satisfies this and then halts. Since we assume that 
T]'j, = T], the above inequality (3) can be regarded as the following inequality 

2 | a |+2 

< a s-th. |a| = i} ■ 




To show that the inequality (3) holds in at least one frequency |o;| for a s.th. 
|a| < k, for both sides by summing over 0 < |a| < fc we get 





< ^ and ^/(a)^ > 1 

a 



e 3 
> 

4-4 



(4) 



since for any a s.th. |a| > k /(«)^ < e/4. 

Thus we have 

1 \ 2|q|+2 

2 ) < E /(«)'■ (5) 

|a|— 0 ^ a s.th. |a|<fc 

As a result, by using formulas ( 2 ) and (4) we can show that the inequality (4) 
holds. This implies that the procedure halts in or before r + 1 round. Since rj' is 
updated 1 + [log 2 (l — 27;)“^] times by r round, hence the procedure halts in or 
before ro = r + l = 2 + [log 2 (l — 277 )“^] . 

Finally, since both of the above two points only fail with probability at most 
(5/2 respectively, hence the procedure fails with probability at most 5. Hence 
Theorem 6 holds. Q.E.D 




By combining the above procedure with the vk\ algorithm, we get a learning 
algorithm without knowing any information about noise rate 77 in advance. It is 
obvious that the learning algorithm outputs hypothesis h within e errors with 
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probability at least 1 - 
its time complexity is 

0{rmn^) = O 



5. Also, since the procedure requests 0{rm) examples, 






6 Conclusion 

In this paper we have presented a procedure that estimates an upper bound ? 7 o 
on the frequency domain. By combining it with Ohtsuki et al.'s algorithm, we get 
a more powerful learning algorithm that works without knowing rj in advance. 
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Appendix 



Supplement to the Proof of Theorem 1 
Proof of Lemma 3 ([7], [8]). 

error ^sign j j = Pr [/(a;) 7^ K^)] 

ct s.th. |o:|<fc ct s.th. |a|>fc 

< (e7.(fc+i) /(a)" + 27c(e^“('=+'^ - 1) E E 
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wj^] e-(-+^)n'=+ Kaf 

^ a s.th. |ct|>fc 

Thus we get error (^sign s.th. |a|<fc Cai'“'“''^Xa(a;)) ) < e for any a s.th. 

|a| < k, since 7 „ = ln(l + y/el2)l{k + 1), 7c = ((1 - 2r])'^+'^ y/e) / (28nV^) 
and J2a — ((1 ~ 2ryo)e)/(28n) for any a s.th. |a| > k. □ 

Proof of Lemma 4 ([7], [8]). Consider the contraposition. If we assume that 
- 7« > C then 
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Abstract. We study impurity-based decision tree algorithms such as 
CART, C4.5, etc., so as to better understand their theoretical under- 
pinnings. We consider such algorithms on special forms of functions and 
distributions. We deal with the uniform distribution and functions that 
can be described as a boolean linear threshold functions or a read-once 
DNF. 

We show that for boolean linear threshold functions and read-once DNF, 
maximal purity gain and maximal influence are logically equivalent. 
This leads us to the exact identification of these classes of functions by 
impurity-based algorithms given sufficiently many noise-free examples. 
We show that the decision tree resulting from these algorithms has mini- 
mal size and height amongst all decision trees representing the function. 
Based on the statistical query learning model, we introduce the noise- 
tolerant version of practical decision tree algorithms. We show that if 
the input examples have small classification noise and are uniformly 
distributed, then all our results for practical noise- free impurity-based 
algorithms also hold for their noise-tolerant version. 



1 Introduction 

Introduced in 1983 by Breiman et al. [4], decision trees are one of the few knowl- 
edge representation schemes which are easily interpreted and may be inferred 
by very simple learning algorithms. The practical usage of decision trees is enor- 
mous (see [24] for a detailed survey). The most popular practical decision tree 
algorithms are CART [4], C4.5 [25] and their various modifications. The heart of 
these algorithms is the choice of splitting variables according to maximal purity 
gain value. To compute this value these algorithms use various impurity func- 
tions. For example, CART employs the Gini index impurity function and C4.5 
uses an impurity function based on entropy. We refer to this family of algorithms 
as “impurity-based”. 

Despite practical success, most commonly used algorithms and systems for 
building decision trees lack strong theoretical basis. It would be interesting to 
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obtain the bounds on the generalization errors and on the size of decision trees 
resulting from these algorithms given some predefined number of examples. 

There have been several theoretical results justifying practical decision tree 
building algorithms. Kearns and Mansour showed in [19] that if the function, 
used for labelling nodes of tree, is a weak approximator of the target func- 
tion then the impurity-based algorithms for building decision tree using Gini 
index, entropy or the new index are boosting algorithms. This property ensures 
distribution-free PAG learning and arbitrary small generalization error given 
sufficiently input examples. This work was recently extended by Takimoto and 
Maruoka [26] for functions having more than two values and by Kalai and Serve- 
dio [17] for noisy examples. 

We restrict ourselves to the input of uniformly distributed examples. We 
provide new insight into practical impurity-based decision tree algorithms by 
showing that for unate boolean functions, the choice of splitting variable accord- 
ing to maximal exact purity gain is equivalent to the choice of variable according 
to the maximal influence. Then we introduce the algorithm DTExactPG, which 
is a modification of impurity-based algorithms that uses exact probabilities and 
purity gain rather that estimates. Let f{x) be a read-once DNF or a boolean 
linear threshold function (LTF) and let h be the minimal depth of decision tree 
representing f{x). The main results of our work are: 

Theorem 1. The algorithm DTExactPG builds a decision tree representing 
f{x) and having minimal size amongst all decision trees representing f{x). The 
resulting tree has also minimal height amongst all decision trees representing 

fix)- 

Theorem 2. For any 6 > 0, given 0(2®^ln^ y) = poly(2'^,\n^) uniformly dis- 
tributed noise-free random examples of f{x), with probability at least 1 — 15, CART 
and C4-5 build a decision tree computing f{x) exactly. The resulting tree has 
minimal size and minimal height amongst all decision trees representing f{x). 

Theorem 3. For any i5 > 0, given 0(2®^ In^ = poly{2^ ,\a^) uniformly dis- 
tributed random examples of f{x) corrupted by classification noise with constant 
rate ij < 0.5, with probability at least 1 — 6, a noise-tolerant version of impurity- 
based algorithms builds a decision tree representing f{x). The resulting tree has 
minimal size and minimal height amongst all decision trees representing f{x). 

Figure 1 summarizes the bounds on the size and height of decision trees, obtained 
in our work. 

We stress that our primary concern is theoretical justification of practical 
algorithms. Thus our the results of learnability of read-once DNF and boolean 
LTF are not necessary optimal. This non-optimality stems mainly from inherent 
redundancy of decision tree representation of boolean functions. 

1.1 Previous Work 

Building in polynomial time a decision tree of minimal height or with a min- 
imal number of nodes, consistent with all examples given, is NP-hard [15]. 
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Function 


Exact 

Influence 


Exact 

Purity 

Gain 


CART, C4.5, etc. 
poly{2^) uniform 
noise-free examples 


Modification of CART, 
C4.5, etc., poly{2'^) 
uniform examples with 
small classification noise 


Boolean LTF 


min size 
min height 


min size 
min height 


min size 
min height 


min size 
min height 


Read-once DNF 


min size 
min height 


min size 
min height 


min size 
min height 


min size 
min height 



Fig. 1. Summary of bounds on decision trees, obtained in our work. 



The single polynomial-time deterministic approximation algorithm, known to- 
day, for approximating the height of decision trees is the simple greedy algo- 
rithm [23], achieving the factor 0(ln(m)) {m is the number of input exam- 
ples). Combining the results of [13] and [10], it can be shown that the depth 
of decision tree cannot be approximated within a factor (1 — e) ln(m) unless 
NP C Hancock et al. showed in [12] that the problem 

of building a decision tree with a minimal number of nodes cannot be approxi- 
mated within a factor 2^°® for any i5 < 1, unless NP C RTIME[2 p°*^^°s"]. 
Blum et al. showed at [3] that decision trees cannot even be weakly learned in 
polynomial time from statistical queries dealing with uniformly distributed ex- 
amples. This result is an evidence for the difficulty of PAG learning of decision 
trees of arbitrary functions in the noise-free and noisy settings. 

Figure 2 summarizes the best results obtained by known theoretical algo- 
rithms for learning decision trees from noise-free examples. Many of them may 
be modified, to obtain corresponding noise-tolerant versions. 

Kearns and Valiant [20] proved that distribution-free weak learning of read- 
once DNF using any representation is equivalent to several cryptographic prob- 
lems widely believed to be hard. Mansour and Schain give in [22] the algorithm 
for proper PAC-learning of read-once DNF in polynomial time from random ex- 
amples taken from any maximum entropy distribution. This algorithm may be 
easily modified to obtain polynomial-time probably correct learning in case the 
underlying function has a decision tree of logarithmic depth and input exam- 
ples are uniformly distributed, matching the performance of our algorithm in 
this case. Using both membership and equivalence queries Angluin et al. showed 
in [1] a polynomial-time algorithm for exact identification of read-once DNF by 
read-once DNF using examples taken from any distribution. 

Boolean linear threshold functions are polynomially properly PAC learnable 
from both noise-free examples (folk result) and examples with small classification 
noise [9]. In both cases the examples may be taken from any distribution. 

1.2 Structure of the Paper 

In Section 2 we give relevant definitions. In Section 3 we introduce a new algo- 
rithm for building decision trees using an oracle for influence and prove several 
properties of the resulting decision trees. In Section 4 we prove Theorem I. In 
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Fig. 2. Summary of decision tree noise-free learning algorithms. 



Section 5 we prove Theorem 2. In Section 6 we introduce the noise-tolerant ver- 
sion of impurity-based algorithms and prove Theorem 3. In Section 7 we outline 
directions for further research. 



2 Background 

In this paper we use standard definitions of PAC [27] and statistical query [18] 
learning models. All our results are in the PAC model with zero generalization 
error. We denote this model by PC (Probably Correct). 

A boolean function (concept) is defined as / : {0, 1}” — >■ {0, 1} (for boolean 
formulas, e.g. read-once DNF) or as / : {—1,1}" — >■ {0,1} (for arithmetic for- 
mulas, e.g. boolean linear threshold functions). Let Xi be the i-th variable or 
attribute. Let x = (x\,... ,x„), and f(x) be the target or classification of x. 
The vector (xi,X 2 , • ■ • , Xn, f(x)), is called an example. Let fxi=a(x), a G {0, 1} 
be the function f(x) restricted to Xi = a. We refer to the assignment Xi = a 
as a restriction. Given the set of restrictions R = {xi^ = ai, . . . ,Xi^ = ak}, the 
restricted function /_r(cc) is defined similarly. Xj G i? iff there exists a restriction 
Xi = a € R, where a is any value. 

A literal Xi is a boolean variable Xi itself or its negation Xi. A term is a 
conjunction of literals and a DNF (Disjunctive Normal Form) formula is a dis- 
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Fig. 3. Example of the decision tree representing f{x) = X 1 X 3 V X 1 X 2 



junction of terms. Let |J^| be the number of terms in the DNF formula F and 
I til be the number of literals in the term ti. Essentially F is a set of terms 
F = {ti, . . . , t|F|} and U is a set of literals, U = {xi^ , . . . The term ti is 

satisfied iff ii^ = . . . = Xi^^.^ = 1- 

If for all 1 < t < n, f{x) is monotone w.r.t. Xi or Xi then f{x) is a unate func- 
tion. A DNF is read-once if each variable appears at most once. Given a weight 
vector ifi = (oi, . . . , a„), such that for all 1 < z < n, Ui G 5ft, and a threshold 
t G 3ft, the boolean linear threshold function fa^t is fa,t{x) = Both 

read-once DNF and boolean linear threshold function are unate functions. 

Let Ci be the vector of n components, containing 1 in the z-th component 
and 0 in all other components. The influence of Xi on f{x) under distribution 
T> is If {€) = Frxr^x>[f{x) f{x © Ci)]. We use the notion of influence oracle as 
an auxiliary tool. The influence oracle runs in time 0(1) and returns the exact 
value of If{i) for any / and z. 

In our work we restrict ourselves to binary univariate decision trees for 
boolean functions. We use the standard definitions of inner nodes, leaves and 
splitting variables (see [25]). The left (right) son of inner node s is also called 
the 0-son (1-son) and is referred to as sq (si). Let c{l) be the label of the leaf 1. 
Upon arriving to the node s, we pass the input x to the {xi = l?)-son of s. The 
classification given to the input x by the T is denoted by ct{x). The path from 
the root to the node s corresponds to the set of restrictions of values of variables 
leading to s. Thus the node s corresponds to the restricted function /f(x). In 
the sequel we use the identifier s of the node and its corresponding restricted 
function interchangeably. 

The height of T, h(T), is the maximal length of path from the root to any 
node. The size of T, |T|, is the number of nodes in T. A decision tree T represents 
f{x) iff f{x) = ct{x) for all x. An example of a decision tree is shown in Fig. 3. 
The function 4>{x) : [0, 1] — >■ 5ft is an impurity function if it is concave, (j){x) = 
4>{1 — x) for any x G [0, 1] and (()(0) = (()(1) = 0. Examples of impurity functions 
are the Gini index 4>{x) = 4,x{\ — x) ([4]), the entropy function 4>{x) = —xlogx — 
(1 — x) log(l — x) ([25]) and the new index (j>{x) = 2i/x(l — x) ([19]). 

Let Sa{i), a G {0, 1}, denote the a-son of s that would be created if Xi is placed 
at s as a splitting variable. For each node s let Pr[sa(*)], a G {0, 1}, denote the 
probability that a random example from the uniform distribution arrives at Sa(z) 
given that it has already arrived at s. Let p{s) be the probability that an example 
arriving at s is positive. The impurity sum (IS) of Xi at s using impurity function 
4i{x) is IS(s,Xi,0) = Pr[so(z)]<('(p(so(*))) + Pr[si(*)]<('(p(si(z))). The purity gain 
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DTApproxPG(s, X, R, (j)) 

1: if all examples arriving at s have the same classification then 
2: Set s as a leaf with that value. 

3: else 

4: Choose Xi = a.Tgm£ix^^^x{PG{fR,Xi,4>)} to be a splitting variable. 

5: Run DTApproxPG(si, X — {a;i}, RU {xi = 1}, <j))- 

6: Run DTApproxPG(so, A — RVJ{xi = Q}, 4>)- 

7: end if 

Fig. 4. DTApproxPG algorithm - generic structure of all impurity-based algorithms. 

DTInfluence(s, X, R) 

1: if yxi € X, = 0 then 

2: Set classification of s as a classification of any example arriving to it. 

3: else 

4: Choose Xi = aigmax^i^xil to be a splitting variable. 

5: Run DTInfluence(si,A — {xi}, R U {xi = 1}). 

6: Run DTInfluence(so,A = {xi}, R U {xi = 0}). 

7: end if 



Fig. 5. DTInfluence algorithm. 

(PG) of Xi at s is: PG(s,a;j,0) = (j){p{s)) — lS{s, Xi, (j)). The estimated values of 
all these quantities are PG, IS, etc. We say that the quantity A is estimated 
within accuracy a>0HA — a<A<A + a. 

Since the value of (f>{p{s)) is attribute-independent, the choice of maximal 
PG(s, Xi, 4>) is equivalent to the choice of minimal IS(s, Xi, 4>). For uniformly dis- 
tributed examples Pr[so(t)] = Pr[si(i)] = 0.5. Thus if impurity sum is computed 
exactly, then (j>{p{so{i))) and (j){p{si{i))) have equal weight. 

Figure 4 gives the structure of all impurity-based algorithms. The algorithm 
takes four parameters: s, identifying current tree’s node, X, standing for the 
set of attributes available for testing, R, which is a set of function’s restrictions 
leading to s and 4>, identifying the impurity function. Initially s is set to the root 
node, X contains all attribute variables and R is an empty set. 

3 Building Decision Trees Using an Influence Oracle 

In this section we introduce a new algorithm, named DTInfluence (see Fig. 5), 
for building decision trees using an influence oracle. This algorithm greedily 
chooses the splitting variable with maximal influence. Glearly, the resulting tree 
consists of only relevant variables. The parameters of the algorithm have the 
same meaning as those of DTApproxPG. 

Lemma 1. Let f{x) he any boolean function. Then the decision tree T built by 
the algorithm DTInfluence represents f{x) and has no inner node such that 
all examples arriving at it have the same classification. 

Proof. See online full version [11]. □ 
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DTMinTerm(s, F) 

1: if 3 ti (z F such that ti — ^ then 
2: Set s as a positive leaf. 

3: else 

4: if F = 0 then 

5: Set s as a negative leaf. 

6: else 

7 : Let tmin ~ arg Xnillti^F {\ii \ } • imin — : ^rri 2 5 • • • ? . | }• 

8: Choose any Xm^ ^ imin- Let tmin ~ imin\{Xmi} • 

9: if Xnij^ — Xjn^ then 

10: Run DTMinTerm(si, U DTMinTerm(so, 

11: else 

12: Run DTMinTerm(so, F’Vitmin} U DTMinTerm(si, F’\{t™i„}). 

13: end if 

14: end if 

15: end if 

Fig. 6. DTMinTerm algorithm. 

3.1 Read-Once DNF 

Lemma 2. For any f{x), which can be represented as a read-once DNF, the de- 
cision tree, built by the algorithm DTInfluence, has minimal size and minimal 
height amongst all decision trees representing f{x). 

The proof of Lemma 2 consists of two parts. In the first part of the proof we 
introduce the algorithm DTMinTerm (see Fig. 6) and prove Lemma 2 for it. 
In the second part of the proof we show that the trees built by DTMinTerm 
and DTInfluence are the same. 

Assume we are given read-once DNF formula F. We change the algorithm 
DTInfluence so that the splitting rule is to choose any variable xt in the small- 
est term tj G F. The algorithm stops when the restricted function becomes con- 
stant (true or false). The new algorithm, denoted by DTMinTerm, is shown 
in Fig. 6. The initial value of the first parameter of the algorithm is the same 
as in DTInfluence, and the second parameter is initially set to function’s DNF 
formula F. The next two lemmata are proved in the full online version of the 
paper [11]. 

Lemma 3. For any f{x) which can be represented as a read-once DNF, the 
decision tree T, built by the algorithm DTMinTerm, 

1. Represents f{x). 

2. Has no node such that all inputs arriving at it have the same classifieation. 

3. Has minimal size and height amongst all decision trees representing f{x). 



Lemma 4. Let xi G ti and Xm G tj. Then \ti\ > \tj\ GG If{l) < If{m). 
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DTCoefF(s, X, tg) 

if J2xiex \°-i\ < is or - J2xiex l®*l > then 
2: The function is constant, s is a leaf. 

3: else 

4: Choose a variable Xi from X, having the largest \ai\. 

5: Run DTCoeff(si, X — {xi}, tg — at) and DTCoeff(so, X — {xi}, tg + Ui). 

6: end if 



Fig. 7. DTCoeff algorithm. 

Proof (Lemma 2). It follows from Lemmata 1, 3 and 4 that the trees produced by 
the algorithms DTMinTerm and DTInfluence have the same size and height. 
Moreover, due to part 3 of Lemma 3, this size and height is minimal amongst 
all decision trees representing f{x). □ 



3.2 Boolean Linear Threshold Functions 

Lemma 5. For any linear threshold function fa,t{^)^ Ihe decision tree built by 
the algorithm DTInfluence has minimal size and minimal height amongst all 
decision trees representing fa,t{x). 

The proof of the Lemma 5 consists of two parts. In the first part of the proof 
we introduce the algorithm DTCoeff (see Fig. 7) and prove Lemma 5 for it. 
In the second part of the proof we show that the trees built by DTCoeff and 
DTInfluence have the same size. 

The difference between DTCoeff and DTinfuence is in the choice of split- 
ting variable. DTCoeff chooses the variable with the largest |ai| and stops when 
the restricted function becomes constant (true or false). The meaning and initial 
values of the first two parameters of the algorithm are the same as in DTInflu- 
ence, and the third parameter is initially set to the function’s threshold t. The 
following lemma is proved in the full online version of the paper [11]: 

Lemma 6. For any boolean LTF fa,t{x) the decision tree T, built by the algo- 
rithm DTCoeff: 

1. Represents fa,t{x). 

2. Has no node such that all inputs arriving at it have the same classification. 

3. Has minimal size and height amongst all decision trees representing fa,t{x). 

We now prove a sequence of lemmata connecting the influence and the coefficients 
of variables in the threshold formula. Let xi and Xj be two different variables in 
f{x). For each of the 2"“^ possible assignments to the remaining variables we get 
a 4 row truth table for different values of Xi and Xj. Let G{i,j) be the multi set of 
2^-2 tables, indexed by the assignment to the other variables. J.e., is the 
truth table where the other variables are assigned values w = wi,W 2 , ■ ■ ■ , Wn- 2 - 
The structure of a single truth table is shown in Fig. 8. In this figure, and 
generally from now on, v and v' are constants in {—1, 1}. Observe that If{i) is 
proportional to the sum over the 2”“^ G^s in G{i,j) of the number of times 
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Xi Xj 


other variables 


function value 


V v' 


Wl,W2,W3, ■ ■ ■ 


,W„-2 


ti 


—V v' 


Wl,W2,W3, ■ ■ ■ 


,Wn-2 


t2 


V —v' 


Wl,W2,W3, ■ ■ ■ 


,W„-2 


ts 


1 

1 


Wl,W2,W3, ■ ■ ■ 


, Wn-2 


ti 



Fig. 8. Structure of the truth table Gw from G{i,j). 



t\ ^ t 2 plus the number of times Similarly, //(j) is proportional to 

the sum over the 2”“^ in G{i,j) of the number of times ti ^ tz plus the 

number of times t 2 ^ 4 - We use these observations in the proof of the following 
lemma (see online full version [11]): 

Lemma 7. If If{i) > If{j) then jaij > |aj|. 

Note that if If{i) = If{j) then there may be any relation between \ai\ and \aj\. 
The next lemma shows that choosing the variables with the same influence in 
any order does not change the size of the resulting decision tree. For any node 
s, let Xs be the set of all variables in X which are untested on the path from 
the root to s. Let ^(s) = {xi,...Xk\ be the variables having the same non- 
zero influence, which in turn is the largest influence amongst the influences of 
variables in Xg. 

Lemma 8. Let Ti (Tj) he the smallest decision tree one may get when choosing 
any Xi G Xg (xj G Xg) at s. Let \Topt\ he the size of the smallest tree rooted at 
s. Then \Ti\ = \Tj\ = \Topt\- 

Proof. Here we give the brief version of the proof. The full version of the proof 
appears in the online version of the paper [11]. 

The proof is by induction on k. For k = 1 the lemma trivially holds. Assume 
the lemma holds for all i < k. Next we prove the lemma for k. Consider two 
attributes Xi and Xj from X (s) and possible values of targets in any truth table 
Guj G G{i,j)- Since the underlying function is a boolean linear threshold and 
If{i) = If{j), targets may have 4 forms: 

— Type A. All rows in Gw have target value 0. 

— Type B. All rows in Gw have target value 1. 

— Type C. Target value / in Gw is defined as / = {aiXt > 0 and UjXj > 0). 

— Type D. Target value / in Gw is defined as f = {oiXi > 0 or OjXj > 0). 

Consider the smallest tree T testing xi at s. There are 3 cases to be considered: 

1. Both sons of Xi are leaves. Since If{i) > 0 and If{j) > 0 there exists at least 
one Gw G G{i,j) having a target of type C or D. Thus no neither Xi nor Xj 
cannot determine the function and this case is impossible. 

2. Both sons of Xi are non-leaves. By the inductive hypothesis there exist right 
and left smallest subtrees of Xi, each one rooted with Xj. Then Xi and Xj 
may be interchanged to produce an equivalent decision tree T' testing Xj at 
s and having the same size. 
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DTExactPG(s, X, R, (jj) 

1: if all examples arriving at s have the same classification then 
2: Set s as a leaf with that value. 

3: else 

4: Choose Xi = a.Tgma,x^^^x{PG{fR,Xi,4>)} to be a splitting variable. 

5: Run DTExactPG(si, A — {xi}, R U {xi = 1}, cf>)- 

6: Run DTExactPG(so, X — {xi}, R U {xi = 0}, (f>). 

7: end if 

Fig. 9. DTExactPG algorithm. 

3. Exactly one of the sons of Xi is a leaf. 

Let us consider the third case. By the inductive hypothesis the non-leaf son of 
s tests Xj. It is not hard to see that in this case G{i,j) contains either truth 
tables with targets of type A and C or truth tables with targets of type B and 
D (otherwise both sons of Xi are non-leaves) . In both these cases some value of 
Xj determines the value of the function. Therefore if we place the test Xj = 1? 
at s, then exactly one of its sons is a leaf. Thus it can be easily verified that 
testing Xj and then Xi, or testing Xi, and then Xj results in a tree of the same 
size (see [11]). □ 

Proof (Lemma 5). One can prove an analogous result for Lemma 8, dealing with 
the height rather than size. Combining this result with Lemmata 1, 6, 7 and 8 
we obtain that the trees built by DTInfluence and DTCoeff have the same 
size and height. Moreover, due to part 3 of Lemma 6, these size and height are 
minimal amongst all decision trees representing fa,t{x). □ 

4 Optimality of Exact Purity Gain 

In this section we introduce a new algorithm for building decision trees, named 
DTExactPG, (see Fig. 9) using exact values of purity gain. The next lemma 
follows directly from the definition ofthe algorithm: 

Lemma 9. Let f{x) he any boolean funetion. Then the decision tree T built by 
the algorithm DTExactPG represents f{x) and there exists no inner node such 
that all inputs arriving at it have the same classification. 



Lemma 10. For any boolean function f{x), uniformly distributed x, and any 
node s, p(so(i)) and p{si{i)) are symmetric relative to p{s): |p(si(i)) — p(s)| = 
b(so(*)) -p(s)| andp{si{i)) yf p(so(i)). 

Proof. See online full version of the paper [11]. □ 



Lemma 11. For any unate boolean function f{x), uniformly distributed input 
X, and any impurity function (f), If{i) > If{j) ^ PG{f,Xi,(p) > PG{f,Xj,(p). 
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Fig. 10. Comparison of impurity sum of Xi and Xj 



Proof. Since x is distributed uniformly, it is sufficient to prove If{i) > If{j) ^ 
BIS{f,Xi,4>) < BIS{f,Xj,(p). Let di be number of pairs of examples differing 
only in Xi and having different target value. Since all examples have equal prob- 
ability If{i) = 2 ^Ft- Consider a split of node s according to Xi. All positive 
examples arriving at s may be divided into two categories: 

1. Flipping the value of i-th attribute does not change the target value of 
example. Then the first half of such positive examples passes to si and the 
second half passes to sq- Consequently such positive examples contribute 
equally to the probabilities of positive examples in si and sq. 

2. Flipping the value of i-th attribute changes the target value of example. 
Consider such pair of positive and negative examples, differing only in Xi. 
Since f{x) is unate, either all positive example in such pairs have Xi = 1 and 
all negative examples in such pairs have Xi = 0, or vice versa. Consequently 
either all such positive examples pass either to si or to sq. Thus such exam- 
ples increase the probability of positive examples in one of the nodes {si, sq} 
and decrease the probability of positive examples in the other. 

The number of positive examples in the second category is di. Thus If{i) > 
If{j) o max{p(si(i)),p(so(f))} > max{p(si(j)),p(so(j))}- By Lemma 10, if 
max{p(si(i)),p(so(*))} > niax{p(si(j)),p(so(j))} then the probabilities of Xi 
are more distant from p(s) than those of Xj. 

Figure 10 depicts one of the possible variants of reciprocal configuration of 
p(si(t)), p{so{i)), p{si{j)), p{so{j)) and p{s), when all these probabilities are 
less than 0.5. Let 4>{A) be the value of impurity function at point A. Then 
BIS{f,Xj,(j}) = </>(p(si(i))) -f <f{p{soU))) = + (j){C) = 4>{A) + (j){D) > 

(p{E) + 4>{F) = BIS{f,Xi,(j)). Note that the last inequality holds due to con- 
cavity of impurity function. The rest of the cases of reciprocal configurations of 
probabilities and 0.5 may be proved similarly. □ 

Proof (Theorem 1). Follows from combining Lemmata 1, 9, 11, 2 and 5. □ 

5 Optimality of Approximate Purity Gain 

The purity gain computed by practical algorithms is not exact. However, under 
some conditions, approximate purity gain suffices. The proof of this result (which 
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DTStatQuery(s, X, R, (j>, h) 

1: if Pr[/fl = > 1 - then 

2: Set s as a positive leaf. 

3: else 

4: if PT[fR = l]{^) < ^ then 

5: Set s as a negative leaf. 

6: else 

7: Choose Xi = argmax^;^ ex{PG{fR,Xi,(j), )} to be a splitting variable. 

8: Run DTStatQuery(si, X — RU {xi — 1}, (j>, h). 

9: Run DTStatQuery(so, X — {xi}, RU {xi = 0}, 4>, h). 

10: end if 

11: end if 

Fig. 11. DTStatQuery algorithm, /i is a minimal height of decision tree representing 
the function. 



is essentially Theorem 2) is very technical and is based on the following lemmata 
(proved in the full online version of the paper [11]): 

Lemma 12. Let f{x) he a boolean function, which can he represented by decision 
tree of depth h. Suppose x is distributed uniformly. ThenVr{f{x) = 1) = ^, r G 
Z, 0 < r < 2^*. 



Lemma 13. Let 4>{x) he Gini index or the entropy or the new index impurity 
function. If for all 1 < i < n and all inner nodes s, the probabilities p(so(i)), 
p(si(i)), Pr[so(i)] ond Pr[si(i)] are computed within accuracy e= then 

IS(s, Xi, (/>) > IS(s,Xj,</>) <t4> IS(s,Xi,(()) > IS(s,Xj,(()) 

Proof (Theorem 2). Sketch. Follows from combining Lemma 13, Theorem 1, 
Hoeffding [14] and Chernoff [8] bounds. See [11] for the complete proof. □ 

6 Noise- Tolerant Probably Correct Learning 

In this section we assume that each input example is misclassified with proba- 
bility r] < 0.5. Since our noise- free algorithms learn probably correctly, we would 
like to obtain the same results of probable correctness with noisy examples. Our 
definition of PC learning with noise is that the examples are noisy yet, nonethe- 
less, we insist upon zero generalization error. Previous learning algorithms with 
noise (e.g. [2] and [18]) require a non-zero generalization error. 

We introduce the algorithm DTStatQuery, which is a reformulation of the 
algorithm DTApproxPG in terms of statistical queries. [2] and [18] show how to 
simulate statistical queries from examples corrupted by small classification noise. 
We obtain our results by adapting this simulation to the case of PC learning and 
combining it with DTStatQuery algorithm. 
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Let Pr[/i^ = 1 ](q;) be the estimation of probability Pr[/ij = l](a) within 
accuracy a. Thus Pr[/fl = l](a) is also a statistical query [18] of the predicate 
fn = l, computed within accuracy a. We refer to the computation of approx- 
imate purity gain, using the calls to statistical queries within accuracy a, as 
PG{fR,Xi,(p,a). The algorithm DTStatQuery is shown at Fig. 11. 

Lemma 14. Let f{x) he a read-once DNF or a boolean linear threshold function. 
Then, for any impurity function, the decision tree, built by the algorithm DT- 
StatQuery, represents f{x) and has minimal size and minimal height amongst 
all decision trees representing f{x). 

Proof. Sketch. Follows from Lemma 12 and Theorem 2. See [11] for the complete 
proof . □ 

Kearns shows in [18] how to simulate statistical queries from examples cor- 
rupted by small classification noise. ^ This simulation involves the estimation of 
noise rate y. [18] shows that if statistical queries need to be computed within 
accuracy a then y should be estimated within accuracy Z\/2 = 0(a). Such an 
estimation may be obtained by taking [j^l estimations of y of the form iA, 
z = 0,l,... ,]"^]. Running the learning algorithm using each time different es- 
timation we obtain [ ^] -|- 1 hypotheses ho, h\, . . . , . By the definition of 

A, amongst these hypotheses there exists at least one hypothesis hj having the 
same generalization error as the statistical query algorithm. Then [18] describes 
a procedure how to recognize the hypothesis having generalization error of at 
most e. The naive approach to recognize the minimal sized decision tree hav- 
ing zero generalization error amongst ho,... ,h^^^ is to apply the procedure 
of [18] with e = However in this case this procedure requires about 2" noisy 
examples. Next we show how to recognize minimal size decision tree with zero 
generalization error using only poly(2^) uniformly distributed noisy examples. 

Amongst [ ^] estimations fji = iAofy there exists i = j such that \y—jA\ < 
A/2. Our current goal is to recognize such j. 

Let 7i = f{^)] be the generalization error of hi over the 

space of uniformly distributed noisy examples. Clearly, "ft > y for all i, and 
jj = y. Let 7i be the estimation of 7 within accuracy Z\/4. Then —jA\ < 

Let H = {i\ \^i~iA\ < ^}. Clearly j G H. Therefore if \H\ = 1 then H contains 
only j. Consider the case of jiL j > 1. Since for all i > y, ii i G H then i > j — 1. 
Therefore one of the two minimal values in H is j. Let i\ and Z2 be two minimal 
values in H. If hi^ and hi^ are the same tree then clearly they are the one with 
smallest size representing the function. If ]zi — 12] > 1 then, using the argument 
i G H -G i > j — 1, we get that j = min{zi, 12}. If jzi — 12] = 1 and [77 —7^2 1 ^ y ? 
then, since the accuracy of 7 is L\/4, j = min{zi,Z2}. The final subcase to be 

^ Aslam and Decatur show in [2] more efficient procedure for simulation of statistical 
queries from noisy examples. However their procedure needs the same adaptation to 
PC learning model as that of Kearns [18]. For the sake of simplicity we show here 
the adaptation to PC model with the procedure of [18]. The procedure of [2] may 
be adapted to PC model in the similar manner. 
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considered is —712! < y |fi — 12] = 1. In this case fj = (74^ + 
estimates the true value of 77 within accuracy Z\/2. Thus running the learning 
algorithm with the value i) for noise rate produces the same tree as the one 
produced by statistical query algorithm. 

It can be shown that to simulate statistical queries and recognize hypothesis 
with zero generalization error all estimations should be done within accuracy 
poly{i^). Thus the sample complexity of the algorithm DTNoiseTolerant is 
the same as in DTApproxPG. Consequently, Theorem 3 follows. 

7 Directions for Future Research 

Basing on our results, the following open problems may be attacked: 

1. Extension to other types of functions. In particular we conjecture that 
for all unate functions the algorithm DTExactPG builds minimal depth 
decision trees and the size of the resulting trees is not far from minimal. 

2. Extension to other distributions. It may be verified that all our results 
hold for constant product distributions (Vf Pr[a;i = 1] = c) and does not 
hold for general product distributions (Vi Pr[ccj = 1] = c^). 

3. Small number of examples. The really interesting case is when number 
of examples is less than poly{2^). In this case nothing is known about the 
size and the depth of the resulting decision tree. 

Moreover we would like to compare our noise-tolerant version of impurity-based 
algorithms vs. pruning methods. Finally, since influence and impurity gain are 
logically equivalent, it would be interesting to use the notion of purity gain in 
the field of analysis of boolean functions. 
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Abstract. In this work we consider the task of pattern recognition un- 
der the assumption that examples are conditionally independent. Pattern 
recognition is predicting a sequence of labels based on objects given for 
each label and on examples (pairs of objects and labels) learned so far. 
Traditionally, this task is considered under the assumption that examples 
are independent and identically distributed (i.i.d). We show that some 
classical nonparametric predictors originally developed to work under 
the (strict) i.i.d. assumption, retain their consistency under a weaker as- 
sumption of conditional independence. By conditional independence we 
mean that objects are distributed identically and independently given 
their labels, while the only condition on the distribution of labels is that 
the rate of occurrence of each label does not tend to zero. The predictors 
we consider are partitioning and nearest neighbour estimates. 



1 Introduction 

Pattern recognition (or classification) is, informally, the following task. There 
is a finite number of classes of some complex objects. A predictor is learning 
to label objects according to the class they belong to (i.e. to classify), based 
only on some examples (labeled objects). One of the typical practical examples 
is recognition of a hand-written text. In this case, an object is a hand-written 
letter and a label is the letter of an alphabet it denotes. Another example is 
recognising some illness in a patient. An object here is the set of symptoms of a 
patient, and the classes are those of normal and ill. 

The formal model of the task used most widely is described, for example, 
in [10], and can be briefly introduced as follows. The objects a; G X are drawn 
independently and identically distributed (i.i.d.) according to some unknown 
(but fixed) probability distribution P on X. The labels y G Y are given for each 
object according to some (also unknown but fixed) function ri{xY . The space 

^ Often (e.g. in [10]) a more general situation is considered, the labels are drawn 
according to some probability distribution P{y\x), i.e. each object can have more 
than one possible label. 
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Y of labels is assumed to be finite (often binary). The task is to construct the 
best predictor for the labels, based on the observed data, i.e. actually to “learn” 
'nix). 

This theoretical model fits many practical applications; however, the i.i.d. 
assumption is often violated in practice, although the object-label dependence is 
preserved, and so the pattern recognition methods seem to be applicable. Con- 
sider the following situation. Suppose we are trying to recognise a hand-written 
text. Obviously, letters in the text are dependent (for example, we strongly ex- 
pect to meet “u” after “q” ) . Does it mean that we can not use pattern recognition 
methods developed within the i.i.d. model for the text recognition task? No, we 
can shuffle the letters of the text and then use those methods. But will the results 
of recognition change significantly if we do not shuffle the letters? It is intuitively 
clear that the answer is negative; it is intuitively clear if we are having in mind 
nearly any popular pattern recognition method. Moreover, in online tasks it is 
often not possible to permute the examples, and so the question is not idle. 

In [9] another model for pattern recognition was proposed, which aims to 
overcome this obstacle. In this model, the i.i.d. assumption is replaced by the 
following. 

The labels y G Y are drawn according to some unknown (but fixed) distri- 
bution P(j/), where P is a distribution over the set of all infinite sequences of 
labels. There can be any type of dependence between labels; moreover, we can 
assume that we are dealing with any (fixed) combinatorial sequence of labels. 
However, in this sequence the rate of occurrence of each label should keep above 
some positive threshold. For each label the corresponding object a; G X is gen- 
erated according to some (unknown but fixed) probability distribution P(x\y). 
The (deterministic) relation y = ni^) for some function y is still required to 
hold. This model for pattern recognition is called co'nditional (i.i.d.) model. 

In [9] some methods are developed to estimate the performance of pattern 
recognition methods in the conditional i.i.d. model. 

In this paper we adapt these methods to the task of studying weak (universal) 
consistency, and prove that certain methods which are known to be consistent 
in the i.i.d. model, are consistent in the conditional model. More precisely, a 
predictor is called (universally) consistent if the probability of its error tends to 
zero for any i.i.d. distribution. In particular, partitioning and nearest neighbour 
estimates are universally consistent. We show that they are consistent in the 
conditional model, i.e. the probability of error tends to zero for any distribution 
satisfying the assumptions of the conditional model. 

Various attempts to relaxing i.i.d. assumption in learning tasks have been 
proposed in the literature. Thus, in [6], [7] the authors study the nearest neigh- 
bour and kernel estimators for the task of regression estimation with continuous 
regression function, under the assumption that labels are conditionally inde- 
pendent of data given their objects. In [5] is presented a generalisation of PAC 
approach to Markov chains with finite or countable state space. In [1] and [8] 
the authors study the task of predicting a stationary and ergodic sequences, and 
construct a weakly consistent predictor. 
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The approach proposed in [9] and developed in the present paper differs 
from those mentioned above in that we do not construct a predictor for some 
particular model, but rather develop a framework to extend results about any 
predictor from the classical model to the proposed one. We then apply the results 
to different well-studied predictors. 



2 Main Results 

Consider a sequence of examples (xi, j/i), {x 2 -, 2/2), • ■ ■ ; each example Zi := {xi, yi) 
consists of an object Xi &X. and a label yt := rj{xi) G Y, where X = for some 
d G N is called an object space, Y := {0, 1} is called a label space and r; : X — Y 
is some deterministic function. For simplicity we made the assumption that the 
space Y is binary, but all results easily extend to the case of any finite space Y. 
The notation Z := X x Y is used for the measurable space of examples. Objects 
are drawn according to some probability distribution P on X°° (and labels are 
defined by rf). 

The notation P is used for distributions on X°° while the symbol P is reserved 
for distributions on X. In the latter case P°° denotes the i.i.d. distribution on 
X°° generated by P. 

Examples drawn according to a distribution P are called conditionally i.i.d. if 
Vn = vi^n) for some function 77 on X and each n, and if P satisfies the following 
two conditions. 

First, for any n G N 

P{xn G A I zi, . . . , Zn-i) = P(x„ G A I y„), (1) 

where A is any measurable set (an event) in X. In other words, x„ is conditionally 
independent of x\, . . . ,x„-i given (on conditional independence see [2]). 

Second, for any y G Y, for any ni, n2 G N and for any event A in X 

P(x„i e A\y„^=y) = P{xn 2 G A | = y). (2) 

Less formally, these conditions can be reformulated as follows. Assume that 
we have some sequence (j/n)neN of labels and two probability distributions Pq 
and P\ on X. Each example a;„ G X is drawn according to the distribution Py„', 
examples are drawn independently of each other. 

Apart from the above two conditions we will need also a condition on the 
frequencies of labels. For any sequence of examples zi,Z 2 ,-.. denote p(ri) = 
^#{* < n : yi = 1}. We say that the rates of occurrence of the labels are 
bounded from below if there exist such <5, 0 < <5 < 1/2 that 

lim P{p{n) G [<5, 1 — <5]) = 1- (3) 

n—^oo 

A predictor is a measurable function P{xi,yi, . . . , Xn-i,yn-i,x) taking val- 
ues in Y. Denote := P{xi,yi , . . . , x„_i, y„_i, x). 
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The probability of an error of a predictor F on each step n is defined as 

err„(T,P,zi, . . . := P{ (x, j/) G Z | y Fn{zi,. . . ,z„_i,x)} 

(Here P in the list of arguments of err„ is understood as a distribution condi- 
tional on 01 , , Zn-i-) We will often use a shorter notation err„(T, P) in place 
of err„(T,P,zi,...,z„_i). 

In this paper we will consider two type of predictors: partitioning and nearest 
neighbour classifiers. 

The nearest neighbour predictor assigns to a new object x the label of its 
nearest neighbour among xi, . . . , Xn- 



5 2/1 j ... 5 ^n—ly Un—l-, — Vj : 

where j = argmiUj_j^ ^_-^\\x — Xi\\. 

For i.i.d. distributions this predictor is also consistent, i.e. if(err„(T, P°°)) — >■ 
0, for any distribution P on X, see [3]. 

We generalise this result as follows. 

Theorem 1. Let F he the nearest neighbour classifier. Let P be some distribu- 
tion on satisfying (1), (2) and (3). Then 



P(err„(P, P)) -)> 0. 



A partitioning predictor on each step n partitions the object space d G N 
into disjoint cells A”, . . . and classifies in each cell according to the majority 

vote: 



P„(Z1, . . ,,Zn-l,x) 



0 if Iyi = lIxieA{x) < Yn=l hi=oIxidA{x) 

1 otherwise, 



where A{x) denotes the cell containing x. Denote diam(A) = sup,,, ||x — i/|| 
and N{x) = IxieA(x)- 

It is a well known result (see, e.g. [4]) that a partitioning predictor is weakly 
consistent, provided certain regulatory conditions on the size of cells. More pre- 
cisely, let P be a partitioning predictor such that diam(A(A)) — >• 0 in probability 
and N(X) — >■ oo in probability. Then for any distribution P on X 



P(err„(P,P“)) ^0. 



We generalise this result to the case of conditionally i.i.d. examples as follows. 



Theorem 2. Let F be a partitioning predictor such that diam(A(A)) -A 0 in 
probability and N{X) -A oo in probability, for any distribution generating i.i.d. 
examples. Then 

P(err„(P, P)) 0. 

for any distribution P on X°° satisfying (1), (2) and (3). 

The proof of the two theorems above is based on the results of paper [9], 
which allow to construct estimates of the probability of an error in conditionally 
i.i.d. model based on the estimates in the i.i.d. model. 

The details can be found in Section 4. 
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3 Discussion 

In this work we have generalised certain consistency results to the model where 
the i.i.d. assumption is replaced by weaker conditions. It can be questioned 
whether these new assumptions are necessary, i.e. can our consistency results 
be generalised yet more in the same direction. In particular, the assumption (3) 
might appear redundant: if the rate of occurrence of some label tends to zero, 
can we just ignore this label without affecting the asymptotic? 

It appears that this is not the case, as the following example illustrates. Let 
X = [0,1], let rj{x) = 0 if x is rational and r]{x) = 1 otherwise. The distribu- 
tion Pi is uniform on the set of irrational numbers, while Pq is any distribution 
such that P{x) yf 0 for any rational x. The nearest neighbour predictor is con- 
sistent for any i.i.d. distribution which agrees with the definition, i.e. for any 
p = P{y = 1) G [0, 1]. (This construction is due to T. Cover.) Next we construct 
a distribution P on which satisfies assumptions (1) and (2) but for which 
the nearest neighbour predictor P is not consistent. 

Fix some <5, 0 < <5 < 1. Assume that according to P the first label is always 
1, (i.e. P(j/i = 1) = 1; the object is an irrational number). Next ki labels are 
always 0 (rationals), then follows 1, then k2 zeros, and so on. It easy to check 
that there exists such sequence ki,k2, - ■ ■ that with probability at least S for each 
n and for each irrational x G xi, . . . , Xn each irrational is between two rationals 
and 

I I 1 — 5 

min la;* - Xj \ < — 

i,j^n, X is between rational X{ and Xj 772(77.) 

where m{n) is the total number of rational objects up to the trial n. On each 
step n such that n = t + X[y=i for some t G N (i.e. on each irrational object) 
we have 

if(err„(T, P)) > 5(1— ^ is the nearest neighbour of a;)) = 5^ 

j<n, Xj is irrational 

As irrational objects are generated infinitely often (that is, with intervals ki), 
the probability of error does not tend to zero. 

4 Proofs 

The tolerance to data A{P,P) of a predictor P is defined as follows 
A{P,P,Zi,...,Zn-i) ■■= max | err„+i(T, P“, zi, . . . , z„) - 

err**_y_|_i (^P, P , , . . . , ZT^(^n—j)) \ 5 

where = -^/nlogn. Tolerance to data reflects the response of a predictor to 
permutations and small changes in the training data: if small portion of the 
training data has been removed (and possibly the training set was permuted) 
then the probability of error should not change significantly. 
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We will also use another version of tolerance to data: 

^{r,P,Zi,...,Zn-i) := ^ max 

I err„+i (/", P°° , zi , . . . , - err„+i (/", P°° , Ci , • ■ • , C«) I • 

where Q = z'^ if i G {zi, . . . ,ij} and Q = Zi otherwise; the maximum is taken 
over all z', n — j < z < n consistent with rj. 

It means that instead of removing (at most) examples from the training 
data (zi, . . . , z„) we replace them with an arbitrary sample. 

Let P be some distribution on X°° satisfying (1) and (2). We say that a 
distribution P on X agrees with P if the conditional distribution P(x\y) is equal 
to Py and P{y) yf 0 for each y GY. Clearly, this defines the distribution P up to 
the parameter p = P{y = 1) € (0, 1). For a distribution P on X°° we denote the 
family of distributions which agree with P by (Pp)pe(o.i) where Pp{y = 1) = p 
for each p in (0, 1). 

The following fact is proved in [9]. 

Theorem 3 ([9]). Suppose that a distribution P on satisfies (1) and (2). 
Fix some S G (0, 1/2], denote p{n) := -#{z < zz : j/j = 0} and Cn ■= P(<5 < 
|p(n)| < 1 — (5) for any rz G N. For any predictor F 

P(err„(r, P) > s) <2( sup Pp(err„(P, Pp) > fe/3) 

[6.1-5] 

+ 2 sup Pp{An+:^„/ 2 {F,Pp) > Se/6)\ + {1 — Cn)- 
for any £ > 0, any n > . 

It is easy to show after [9] that the theorem also holds if A is replaced by A. 
The condition (3) implies (1 — C„) — >■ 0. 

From these facts and Theorem 3 we can deduce the following supplementary 
result, which will help us to derive theorems 1 and 2. 

Lemma 1. Let F be such a predictor that 

sup P(err„(P, Pp)) 0 (4) 

pe[5,i— 5] 

and either 

sup E{An{F,Pp))^0. (5) 

pG [(5,1 — (5] 
or 

sup E{An{F,Pp)) ^ 0. (6) 

pG [(5,1 — (5] 

Then for any distribution P satisfying (1), (2), and (3) we have 



P(err„(P,P)) 0 
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Observe that both nearest neighbour and partitioning predictors are sym- 
metric, i.e. do not depend on the order of zi, . . . , Moreover, 



max 

k'>n— >Cn\{‘ii 



A{r,P,Zi,...,Zn-l) 

err„(r, zi, . . . , z„) - errfc(r, Zi^ , . . . , z*^ ) | 



< max 

fc>n— 



P{x,y : y = r(zi,. . . , z„, x), y yf r{z,^, . . . , Zi^,x)} 



-P{x,y : y ^ r{zi, . . . ,Zn,x),y = r{z,,, . . . , Zi^,x)} 

<2 max P{x : r{zi,...,Zn,x) r{z,^,...,Zi^,x)} 

k>n— >Cn'.,{ii,....,ik} Cl \1 n} 



( 7 ) 



and 



A{P, P,Zi,..., Zn-i) < 2 max 

P{x : P{zi,...,Zn) yf p{z 7r(l), ■ ■ ■ , '2^7r(n— ^n—j+1^ . . . , Z ) }. 

Thus to obtain estimates on A (4) we need to estimate the expectation of the 
probability of a maximal area on which the decision of P changes if a portion of 
training examples is removed (replaced with an arbitrary sample). 

Furthermore, we will show that for both predictors in question (6) and 

Fl(err(T,P)) ^ 0 (9) 



for any distribution P on X (which we already have) imply (4). Than it will 
only remain to prove (6) for each predictor and apply Lemma 1. 

We define the conditional probabilities of error of P as follows 

err°(P) := P(y„ P„ | = 0;xi,yi, . . . ,Xn-i,yn-i), 

err)j(P) := P(y„ P„ | = l;xi,yi, . . . ,Xn-i,yn-i), 

(with the same notational convention as used with the definition of err„(P)). 

In words, for each y G Y = {0,1} we define err(( as the probability of 
all X G X, such that P makes an error on n’th trial, given that yn = y and 
given (random variables) xi, j/i, . . . , x„_i, y„_i. We will also use more explicit 
notations for err^(P) specifying the distribution or the input sequence of labels, 
when the context requires. 

Obviously, err„(P) < err°(P) -|- err^(P). Thus, we only need to show that 
(4) holds for, say, erri(P), i.e. that 

sup P(err^(P,Pp)) -)> 0. (10) 

pG [<5,1 — 5] 

Observe that for each of the predictors in question the probability of error 
given that the true label is 1 will not decrease if an arbitrary (possibly large) 
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portion of training examples labeled with ones is changed to (arbitrary, but 
correct) portion of the same size of examples labeled with zeros. Thus, for any 
n and any p G [i5, 1 — 5] we can decrease the number of ones in our sample (by 
replacing the corresponding examples with examples from the other class) down 
to (say) S/2, not decreasing the probability of error on examples labeled with 1. 
So, 



E{errl{r,Pp)) < E{eTTl{r, Ps/ 2 \Pn = n(S/2))) + Pp(p„ < n(S/2)), 

where = #{* ^ n : pi = 1}. Obviously, the last term (quickly) tends to zero. 
Moreover, it is easy to see that 

E(err/^(P,Ps/ 2 )lPn = n(S/2)) 

< E(err^(P, Ps/2)ln(S/2) - xr„/2 < < n(S/2) + >c„/2) + An{P,Ps/ 2 ) 

< — P{^^^n{PjPs/2)) + An{P,Ps/2), 

y n 

which tends to zero if An{P, P) — t 0 (for any P). Thus, it remains to obtain (6). 
This we will establish for each of the predictors separately. 

Nearest Neighbour predictor. We will derive the estimates on A which do not 
depend on p = P{y = 1) and tend to zero. 

Fix some distribution P on X and some e > 0. Fix also some n G N and 
denote 

Bn{x) := P{t G X : t and x have the same nearest neighbour among xi, . . . , x„} 

and Bn := E{Bn{x)) (we leave x\,...,Xn implicit in these notations). Note 
that E{B„) = 1/n, where the expectation is taken over xi, . . . , Denote B = 
{(xi, . . . , Xn) G X" : E{Bn) < 1/ne} and A{x \, . . . , x„) = {x : Bn{x) < 1/ne^}. 
Applying Markov’s inequality twice, we obtain 

P(A„(P,P)) <P(A„(P,P)|(xi,...,x„) GP) + e 

<2E( max P{x : P{zi, . . . , Zn,x)y^ P{zi^, . . . , Zi^,x)\x G A} 

fc>n— 

|(xi, . . . , Xn) € B) + 2e 

(11) 

Observe that the same estimate holds true if P is replaced by either Pi or Pq, the 
class-conditional versions of P. Thus (possibly increasing n) we can obtain (11) 
for all Pp, p G [i5, 1 — 5] simultaneously. Indeed, denoting for each y G {0, 1} 

Bn(x) := P{t G rj~^{y) : tand a; have the same nearest neighbour among xi, . . . , Xn} 

we have 

E{Bn) < P(pi) + P(P°) < EpABln/s) + EpABL/s) + P{Pn i [^2, l-<5/2]), 
which is bounded uniformly in p. 
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Note that removing one point Xi from a sample xi , . . . , we can only change 
the value of F in the area 

{a; G X : Xi is the nearest neighbour of x} = Bn{xi), 

while adding one point xq to the sample we can change the decision of F in the 
area 

Dn(xo) := {x G X : xo is the nearest neighbour of x}. 

It can be shown that the number of examples (among xi, . . . ,x„) for which a 
point xo is the nearest neighbour is not greater than a constant 7 which depends 
only the space X (see [4], Corollary 11.1). Thus, H„(xo) C for 

some ji, ... ,j-y, and so 

E{An{F, P)) <2e + 2(7 + l)xr„if(maxi3„(x)|x G Al|(xi, . . . ,x„) G 

xGX 

7 + 1 

^ 2xr„ r — h 2e, 

ne^ 

which, increasing n, can be made less than 3s. I 

Partitioning predictor. For any measurable sets B G X" and Al G X denote 

D{A,B) := e( max 

P{x : P{zi,...,Zn) yf p{z ■7t(1)) ■ ■ ■ 1 Z-ir{n—j)i , zO^GAl} 

(xi,...,x„) G By 

and D := D(X”,X). 

Fix some distribution P on X and some e > 0. From the consistency results 
for i.i.d. model (see, e.g. [4], Chapter 5) we know that P|f)„(x) — t 7 (x)| — >■ 0. It 
can be also inferred from the proof (see [4], Theorem 6.1) that the convergence 
is uniform in p = P{y = 1) G [<5, 1 — <5]. Hence, P|r)„(x) — T7 (x)| < from 
some n on for all p G [<5, 1 — <5]. Fix any such n and denote B = {(xi, . . . , x„) : 
P|r7„(x) — rj{x)\ < e^}. By Markov inequality we obtain P{B) > £^. For any 
(xi, . . . , x„) G B denote A{x \, . . . , x„) the union of all cells A” for which P(|r)„ — 
?7||x G A”) < £. Clearly, with xi,...,x„ fixed, P(x G M(xi, . . . , x„)) > £. 
Moreover, D < D{B, A) + e + e^. 

Since r](x) is always either 0 or 1, to change a decision in any cell A <Z A'we 
need to add or remove at least (1 — £)7V(H) examples, where iV(£l) = N{x) for 
any x G A. Denote N{k) = E{N{x)) and A{k) = E{P{A{x))). Clearly, = 1 

for any n, as E^^-^ = A{n). 

Thus, ^ N{ny = As before, using Markov inequality and shrinking A if 

necessary we can have P C — ^1^ G A) = 1, P C ^ A) = 1, 

and D < D{B, M) + 3£ + s^. Thus, for all cells A G M we have N{x) > enA{n), 
so that the probability of error can be changed in at most 2 ^CCln) cells; but 

the probability of each cell is not greater than Hence E{An{F, P)) < 

2^ + 3£ + £A I 
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Abstract. Rademacher and Gaussian complexities are successfully used 
in learning theory for measuring the capacity of the class of functions to 
be learned. One of the most important properties for these complexities 
is their Lipschitz property: a composition of a class of functions with a 
fixed Lipschitz function may increase its complexity by at most twice the 
Lipschitz constant. The proof of this property is non-trivial (in contrast 
to the other properties) and it is believed that the proof in the Gaussian 
case is conceptually more difficult then the one for the Rademacher case. 
In this paper we give a detailed prove of the Lipschitz property for the 
general case (with the only assumption wich makes the complexity notion 
meaningful) including the Rademacher and Gaussian cases. 

We also discuss a related topic about the Rademacher complexity of 
a class consisting of all the Lipschitz functions with a given Lipschitz 
constant. We show that the complexity is surprisingly low in the one- 
dimensional case. 



1 Introduction 

An important problem in learning theory is to choose a function from a given 
class of functions (pattern functions) which best imitates (fits) the underlying 
distribution (for example, has the smallest error for a classification problem). 
Usually we don’t know the underlying distribution and we can only asess it via 
a finite sample generated by this distribution. For a success of this strategy we 
usually require that the difference between the sample and true performance is 
small for every function in the class (if the sample size is sufficiently large) . This 
property is referred as uniform convergence over the class of functions. 

If a set is so rich that it always contains a function that fits any given ran- 
dom dataset, then it is unlikely that the chosen function will fit a new dataset 
even if drawn from the same distribution. The ability of a function class to fit 
different data is known as its capacity. Clearly the higher the capacity of the 
class the greater the risk of overfitting the particular training date and identi- 
fying a spurious pattern. The critical question is how one should measure the 
capacity of a function class. One measure, successfully used in learning theory, 
is the Rademacher and Gaussian complexities. The definition rests on the intu- 
ition that we can evaluate the capacity of a class of functions by its ability to fit 
random data. We give first the definition for Rademacher complexity which can 
be then readily generalized for an arbitrary complexity. 
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Definition 1. (Rademacher complexity) Let X he an input space, D he a 
distribution on X, and F be a real-valued function class defined on X. Let 
S = {xi, ■ ■ ■ ,xi} be a random sample generated (independently) by D. The em- 
pirical Rademacher complexity of F for the given sample S is the following 
random variable: 



Ri{F) = Er 



2 

sup - 
/SF I 



'^nf{xi) 



i=l 



where r = {r\, ■ ■ ■ r{\ are iid {Fl^-valued random variables with equal probabili- 
ties for -1-1 and —1 and the expectation is taken with respect to r. 

The Rademacher complexity of F is 



Ri{F)=¥.s 


'Uf) 


= Esr 


2 

sup - 


1 

'^nf{xi) 








/gf « 





Definition 2. (Gaussian complexity) We get the definition of the Gaussian 
complexity if in Definition 1 we substitute the Rademacher Fl-valued random 
variables ri, • • • ,ri by the independent Gaussian iV(0, 1) random variables 
gi, - ■ ■ ,gi- The empirical Gaussian complexity and the Gaussian complexity are 
usually denoted by GfiF) and GfiF) respectively. 

The relation between these complexities (Rademacher and Gaussian) is dis- 
cussed in [2]. 

Definition 3. Ln the same way as in Definition 1 we can define complexity for 
an arbitrary distribution p on the real line (instead of the Rademacher distribu- 
tion concentrated at the points ±1^. We just need to assume that the expectation 
integral is hounded for any choice of f € F and any choice of points Xi, - ■ ■ ,xi). 
We can use the following general notations CfiF) and CfiF) for the empirical 
p-complexity and p-complexity, respectively. 

Now we formulate a result which shows the importance of these notions. It 
bounds the error of pattern functions in terms of their empirical fit and the 
Rademacher complexity of the class. We formulate the result in the form more 
usual for applications: instead of the input space X we consider the sample space 
Z ■.= X xY (we can assume that Y = {1,-1} as it is the case for the binary 
classification), and instead of the functions F = {f} on X we consider (loss) 
functions H = {h} defined on Z. A function h(x, y) can be defined, for example, 
as some ’soft’ thresholding at zero of a function yf{x). 

Theorem 1. Fix S € (0, 1) and let H he a class of functions mapping from Z to 
[0, 1]. Let Zi, - ■ ■ ,zi be drawn independently according to a probability distribution 
D. Then with probability at least 1 — <5 over random draws of samples of size I, 
every h G FI satisfies: 

En[h{z)] < nh{z)] + RfiH) + 

<E[M^)]+R,(iI) + 3yi^^, 



( 1 ) 
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where Ed[/i(z)] is the true expectation of h{z) and¥\h{z)] is the corresponding 
empirical one. 

The idea with this result is that if we manage to find a function h G H 
with a small empirical expectation (the empirical loss is small) then the theorem 
guaranties (with a high probability) that the same h will provide a small value 
for the true loss (under the assumption that the Rademacher complexity of the 
pattern functions is small). 

In complexity estimations (Rademacher or Gaussian) one uses different prop- 
erties of the complexity function. In Section 2 we formulate the most important 
properties. Their proofs are relatively straightforward from the definition of com- 
plexity except for one: the Lipschitz property: a composition of a class of func- 
tions with a fixed Lipschitz function may increase its complexity by at most twice 
the Lipschitz constant. The proof of this property represented in the literature is 
quite non-trivial and uses different approaches for the Rademacher and Gaussian 
cases (see [5], Th.4.12 and Th.3.17). It is also believed (see [5], Th.4.12) that 
the proof in the Rademacher case is conceptually more simple. In Section 4 we 
give a detailed prove of the Lipschitz property for the general case. This general 
result we formulate in Section 2 and there also we discuss the importance of the 
Lipschitz property in complexity estimations for different function classes. 

In Section 3 we discuss the following related question: Let H = {h} he & set of 
Lipschitz functions defined on some space X. We assume that the corresponding 
Lipschitz constants are uniformly bounded by some quantity. We can assume 
that this quantity is 1 (this means that the functions {/i} are contractions); 
otherwise we could divide each function h G H hy that quantity. Such function 
classes often arises in applications (see, for example, [3]). We ask the following 
question: what is the Rademacher complexity for the class H consisting of all 
contractions defined on the input space X7 (To avoid unbounded function classes 
giving infinite Rademacher complexity, we need to normalize H in some way; 
one possible normalization in case X = [0, 1] is to assume that h{0) = 0 for all 
h G H. This makes H uniformly bounded on [0,1].) It turns out that in the 
one-dimensional case (X = [0, 1]) this complexity is surprisingly small and is at 
most twice as large as the Rademacher complexity of a single function h{x) = x. 
(See Theorem 7 for the details.) 

This problem statement is somewhat opposite to that in Section 2: In that 
section we take a class of functions F and compose each function f G F with 
a fixed Lipschitz function (j) (we can assume that </> is a contraction). Then 
the Rademacher complexity of the function class (f> o F is at most twice the 
Rademacher complexity of the class F. Opposite to this, in Section 3 we take 
the class of all contractions H, say on [0,1], which can be considered as the 
compositions ho I oi the contractions h G F[ with a single function I{x) = x 
(an identical mapping). It turns out that even in this case the Rademacher 
complexity of the composition class is at most twice the Rademacher complexity 
of the original function I{x). Note that the function I{x) is an element of the 
class F[ and the above result says that the Rademacher complexity of the whole 
class F[ is at most twice the Rademacher complexity of its single element. 
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2 Lipschitz Property for Rademacher and General 
Complexities 

We start this section by giving an example about using the Lipschitz property 
for estimating the Rademacher complexity of the class H in Theorem 1. 

Often in practice h{x,y) {h G H) is defined as some ’soft’ thresholding at 
zero of a function yf{x) (/ G F), say h{x,y) = A{yf{x)), where the function A 
is a ’smooth’ version of the Heaviside function: one takes A{f) to be a continuous 
function on K such that A{t) > 0 if t > 0, A{t) = 0 if t < —7 (for some 7 > 0) 
and Aft) is linear on the interval [— 7 , 0]. Evidently the function A is a Lipschitz 
function (with the Lipschitz constant I/ 7 ), and here one has the use of the 
property that a composition of a class of functions with a fixed Lipschitz function 
increases the Rademacher complexity of the class by at most twice the Lipschitz 
constant. The Rademacher complexity of the class of functions {yf{x), / G F} 
(defined on Z = X x T) is the same (easy to see from the definition) as the 
Rademacher complexity of the class {/, / G F} (defined on X). It remains to 
estimate the Rademacher complexity of the class {/, / G F}. In a particular 
but important case (for example when working with kernel methods) when {/} 
are linear functions defined on the unit ball, the Rademacher complexity can be 
bounded by 2/^/1 (for more details see [4], Chapter 4). 

Here we have described one application of the Lipschitz property to bound- 
ing the Rademacher complexity. For many other interesting applications of this 
property we recommend to see [2] . 

Now we formulate some useful properties for the Rademacher complexity. 
(Many of these properties are true even for the Gaussian complexity Gi] the 
others hold for Gi with an additional factor InL See [2] for the details): 

Theorem 2. Let F, Fi, - ■ ■ ,F„ and G he classes of real functions. Then: 

(1) If FCG, then F,(F) < F;(G). 

(2) Ri{F) = RiiconvF). 

(3) Ri{cF) = |c|.R/(F) for every c G K. 

(4) If A :R. — > K is a Lipschitz with constant L and satisfies A(0) = 0, 
then R[{A o F) < 2LRi{F). 

(5) Ri{F + h) < Ri{F) + 2-y/]E[/i2]/Z for any function h. 

(6) For any I < q < 00 , let Lp,h,q = {|/ - hf^ , f G F}. If \\f - h\\oo < 1 

for every f G F, then Ri{LF,h,q) < 2.q (^Ri{F) + 2-^'k[K^]/l. 

( 7 ) Ri{Yri=iFi)<Yri=iRm)- 

Here conv(F) means the convex hull of F and AoF = {Ao f, f G F}, where 
Ao f denotes the composition of A and /. 

The proofs for these results, with the exception of (4), are all relatively 
straightforward applications of the definition of empirical Rademacher complex- 
ity. In Section 4 we prove the property (4) in the following general setting: 
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Theorem 3. Let fj, be an arbitrary distribution on K with zero mean. Let Ci 
denote the corresponding empirical complexity for this distribution (as defined 
in Definition 3). Let A : K — > R &e a Lipschitz function with constant L and 
satisfy A(0) = 0. Then for any real-valued function class F we have: 

Ci{AoF)<2LCi{F). 

If the function A is an odd function (A(— t) = — A(t)) then we can drop the 
factor 2 in the last theorem: 

Theorem 4. Let p, be an arbitrary distribution on K with zero mean. Let Ci 
denote the corresponding empirical complexity for this distribution (as defined 
in Definition 3). Let A : R — > R be an odd (A{—t) = —A(t) ) Lipschitz function 
with constant L and satisfy T(0) = 0. Then for any real-valued function class F 
we have: 

Ci{AoF)<LCi{F). 

Note that the only condition on p in the above theorems having zero mean 
is absolutely necessary to make the notion of complexity meaningfull. If the 
mean of p is different from zero then even a single function which is an identical 
constant has a big complexity, which does not go to zero as the sample size goes 
to infinity; and this is unreasonable. 

(In the above theorems we assume that the expectation integral of Definition 
3 is bounded for any choice of f € F and any choice of points xi, - ■ ■ ,xi). 

The assertions in the last two theorems trivially hold for the non-empirical 
complexity (Cj) as well. 

Question: Is the factor 2 optimal in Theorem 3? 

3 Rademacher Complexity of Lipschitz Functions 

Let X = [0, 1] C R. Our aim is to estimate the Rademacher complexity of 
the set of all Lipschitz functions on [0, 1] with Lipschitz constant at most L. 
This set is of course not uniformly bounded (contains all the constant functions, 
for example), which makes its Rademacher complexity infinite. To make these 
functions uniformly bounded we request that each function vanishes at some 
point on [0,1]. (This makes the function class uniformly bounded; one could 
demand this property instead.) It turns out that the Rademacher complexity of 
this class is very small and can be compared with the Rademacher complexity of 
a single function. We formulate Theorems 5 and 6 for the empirical Rademacher 
complexity Ri{H) and we estimate (the non-empirical complexity) Ri{H) in 
Theorem 7. 

Theorem 5. Let H be the class of Lipschitz functions with Lipschitz constants 
at most L on the interval A = [0, 1] and vanishing at some point of this interval. 
Then for any set of points {xi,--- ,xi}cAwe have 

Ri{H) < 2 LRi{\a), 

where 1 a is the function identically equal to 1 on A. 
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If we consider the class of functions vanishing at the origin we gain factor 2: 

Theorem 6. Let H he the class of Lipschitz functions with Lipschitz constants 
at most L on the interval A = [0, 1] and vanishing at the point 0. Then for any 
set of points {xi,--- ,xi\ G A we have 

Ri{H)<LRi{\a), 

where 1 a is the function identically equal to 1 on A. 

In the above theorems we have compared the Rademacher complexity of the 
whole class H with the one of a single function 1/^. In the next theorem we make 
comparison with the function which is the identical mapping I(x) = x: 

Theorem 7. Let H he the class of Lipschitz functions with Lipschitz constants 
at most L on the interval A = [0, 1] and vanishing at the point 0. Then for any 
symmetrical distribution D on A = [0, 1] (symmetrical with respect to the middle 
point 1/2) we have 

Ri{H) < 2LRi{L), 

where 1 is the identical mapping L{x) = x. 

Note that in Theorem 7 the function L{x) is an element of the class H, and 
the theorem says that the Rademacher complexity of the whole class H is at 
most twice the Rademacher complexity of its single element. This result can be 
viewed also in another way: Composing all the functions h G H with a single 
function L(x) = x (which does not change h) may increase the Rademacher 
complexity (compared with the Rademacher complexity of the single function 
I{x)) at most twice the Lipschitz constant L. It is interesting to compare this 
result with Theorem 3. 

To estimate the Rademacher complexity in the right-hand sides of the last 
three theorems we can use property (5) of Theorem 2, where we take F containing 
only the identically zero function. This gives that Ri{F) = 0 and consequently 
Ri{h) < 2/'/l a h < 1 on [0, 1]. 

The authors don’t know the answers to the following questions: 

Question. Is it possible to drop the factor 2 in Theorem 5? 

4 Proofs 

Proof of Theorem 3. Our proof is inspirated by the proof of Theorem 4.12 
from [5], but we tried to simplify it and added some new components making 
it universal for arbitrary arbitrary distribution /i. Without loss of generality we 
can assume L = 1 for the Lipschitz constant. This means that the function A is 
a contraction: |A(t)— A(s)| < |t— s|. Fix {xi, • • • , x/}. For simplicity of notations 
we denote /(xj) = fi. So now if r = (ri, • • • , n) denote the i.i.d. random variables 
generated by p,, then we have to prove that 



Er 


sup 


i 

'^nA{f^) 


< 2E^ 


sup 


1 




_/eF 


i^l 




_f^F 





E. 



(2) 
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Denote the left-hand side of the last inequality by L (don’t mix with the 
Lipschitz constant, which, by our assumption, is now 1) and the right-hand side 
by R. We can assume that the function class F is closed with respect to the 
negation {f G F => — / G F), otherwise adding the set {— /} to = {/} can 
only increase L and leaves R unchanged. We also assume that the identically 
zero function also belongs to F. This does not change R and does not change L 
(since A(0) = 0). 

Denote ^“''(t) = A{t), A~{t) = —A{—t) and denote A^ = {A+, A~}. We 
introduce the following quantity: 



jl 


sup 


1 

D^(/i) 









It suffices to show that L < M and M < R. The first inequality is evident. 
We need to prove that M < R. Since F is closed with respect to negation, we 
can drop the absolute value sign in the expression for R. The same we can do in 
the expression for M (introducing A~ served this purpose). So, what we need 
to prove now is: 



Er 


1 

sup V riA{f,) 


< 2Er 


1 

sup V r,A{fi) 




jeF,AeA± 







Evidently, for each fixed r = (ri, • • • ,ri) we have 



i i 

sup V r,A{f,) < sup V : 



+ sup (/i). 



(Here we use the fact that the identically zero function belongs to F, which 
makes positive both terms in the right-hand side of the last inequality.) 

It is this last inequality where the factor 2 comes from in Theorem 3. 

Now in order to prove (3) it suffices to proof that 



Er 


1 

sup V nA{f,) 


< Er 


1 

sup 'y' rj. 











(4) 



for arbitrary compression A with H(0) = 0. 
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The main idea now is the following : Instead for proving (4) immediately, we 
introduce an intermediate expression sup^g^ T^\=i > where each Ai 

is either A or the identical mapping I{t) = t. If all the A^ are A, we get the left- 
hand side of (4), and if all the Ai are I we get the right-hand side. Now reducing 
the number of A^’s equal to A one-by-one (induction principle), we show that 

on each step we increase the value of E^ sup^g^^ Y^i=i 
It is enough to show the first step: Prove that 



E^ 



SUp(riAi(/i) -b r2A2(/2) H nA;(/;)) 

f(^F 



< 



Er 



sup(ri/i -b riA^U^ ^ nA;(/;)) 

S^F 



A first naive attempt fails: 



(5) 



sup(riAi(/i) -b r2A2(/2) H riAi{fi)) 

f(^F 






sup(ri/i -b r2A2{f2) H nAi{fi)) . 

_A^F 

Next attemt: Group (ri,r2,--- ,n), ri > 0, with (— ri,r2,--- ,n)- Here, to 
start with, we make an extra assumption that the measure fx is symmetric. (For 
a technical simplicity you can even imagine that ^ is a descrete measure.) Later 
we comment on the case when /i is not necessarily symmetric. So we have to 
prove that 



sup(riAi(/i) -I- T2A2{f2) -I- • • • -I- riAi{fi)) + (6) 

f(^F 

sup(-riAi(/i) -b r2A2{f2) H h nAi{fi)) < 

f(^F 

sup(n • /i -k T2A2{f2) H h riAi{fi)) + 

f(^F 

sup(-ri • /i -k r2A2{f2) H h riAi{fi)). 

f(^F 

To prove the last inequality it suffices to show that for each couple of functions 
{/+,/“} C F there is another couple of functions {g'^ t 9 ~} C F such that 



(ffi • Mfi) + ?'2A2(/^) -I- • • • -I- riAi{fj^))+ 
(-ri • A(/f) -br2A2(/2") H hnA;(/f)) < 



{ri-gt +?’2A2(5^)H \-nAi{g+))+ 

{-ri-gi +r2A2{g2)-\ \-riAi{g^)). 
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The choice = f~^, g = f gives 

A{f+) - A{fr) < /+ - /f. 

The choice g^ = f~ , g~ = /'*’ gives 

A{f+) - A{fr) < fr - ft- 

Due to the compression property of A, at least one of the last two inequalities 
is true, namely, the one for which the right-hand side is non-negative. This proves 
( 6 ). 

Now integrating both sides of (6) over the domain [0, oo) x (— oo, oo)* ^ with 
respect to the measure d^(ri) x • • • x d/t(n) we get 5. 

The proof above works in the case of symmetrical /i. If g, is not symmetrical 
we do as follows (a sketch of the proof): We appoximate ghy a smooth measure 
gi which is in addition positive on the minimal interval containing the support 
of g and which has zero mean. For a small e devide the real line by points 
Ci, i = 0, ±1, ±2, • • • , ±N, so that cq = 0 and ^ rdgi{r) = e. Now from each 

interval [ci-i,Ci],i = 1,2, ■ ■ ■ N choose a point Ri so that Ri ft ^ dg\{r) = e. In 
the same way we choose points Ri from [c^, Cj+i], z = —1, —2, ■ ■ ■ — N. Consider a 
discrete measure g 2 which puts masses Jt' ^dgi{r) at Ri and use this measure 
as an approximation for the initial measure g. Now group together Ri and R-i- 
This two points have different probabilities but they have equal expectations. 
This allows us to use arguments similar to those used in the symmetric case. 
Theorem 3 is proved. 

Proof of Theorem 4. We need to prove that 



sup 'Y' riA{fi) < Er 

7^1 \ 



sup Yrji 



We assume again that F is closed with respect to negation (this does not change 
(7)). The expressions inside the absolute value signs in the both sides of (7) are 
odd functions in /. This means that we can drop the absolute value sings. The 
rest follows from (4). 

Theorem 4 is proved. 

Proof of Theorem 6. Without loss of generality we can assume that L = 
1. Fix {xi,--- ,xi} C = [0, 1], 0 < < X 2 < • • • < X; < 1. Fix r = 

(ri,--- ,n), Ti = ±1, z = I,-- - ,Z. It can be shown that the class H in the 
theorem is compact in the uniform (T°°) metric. This means that the supremum 
in the definition of the Rademacher complexity is achieved for some function 
h{x) (depending on (ri, • • • , r/) and {xi, • • • , x;}). Then —h{x) also provides the 
same supremum. So we can assume that 



i i 

sup V'ri/(xj) =Y rih{xi). 
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In particular, we have that 

i i 

'^nh(xi) > '^rif(xi), V/ G H. (8) 

i=l 

Denote = xi — 0, c ?2 = X 2 — xi, • • • , d/ = x/ — x/_i. We have 

i 

d^>0,'^d^<l. (9) 

Due to the Lipschitz condition (with L = 1) we have |ft.(a;i)| < di- Consider 
the quantity sgn(ri + • • • + ri), where sgn(x) = 1 if a; > 0, sgn(x) = — 1 if a: < 0 
and sgn(O) = 0. If sgn(ri + • • • + ri) > 0, then we must have h{xi) = di in 
order to guarantee (8) (otherwise we could lift the function h{x) on the interval 
[a^i, 1]; the new function still will be Lipschitz with constant 1, but this lift would 
increase the left-hand side in (8)). If sgn(ri ri) < 0, then we must have 

h{xi) = —di- If sgn(ri ri) = 0, then the lifting of h{x) up (or down) on 

the interval [xi, 1] does not effect the left-hand side in (8). So we can assume 
that 



h{xi) = disgn(ri H hri). 

Now having fixed h{x\) we can show in the same way that 



h{x2) = h{xi) + d2Sgn(r2 H h p) = diSgn(ri H hn) -|-d2Sgn(r2 H h p). 

In general for i = 1, • • • , Z we have 



h{xi) = disgn(ri H h n) H h diSgn(ri H h n). 

The last equality gives an expression for the left-hand side in (8) only in 
terms of r = (ri, • • • , r;) (recall that di, • • • , d; are fixed): 



i 

nh{xi) = ri [diSgn(ri H h ri)] -b ( 10 ) 

i=l 

r2[disgn(ri H h n) -|- d2Sgn(r2 H h rj)] H h 

n[disgn(ri H h q) -b d2Sgn(r2 H h q) H h diSgn(q)]. 



The expectation of the last expression is exactly the empirical Rademacher 
complexity. In order to estimate this expectation we denote mi-i+i := Er[q(ri-|- 
• • --|-r/)]. Evidently it depends only on the index Then for the Rademacher 

complexity we get from (10) that (now we write hr instead of h to indicate the 
dependence of ft, on r): 



Er 



y^JihrjXi) 



= diirii + [dimi + d2TO/_i] -I- • • • -I- 



( 11 ) 



[diirii + d2m/_i H h diirii] = 

di[l ■ mi] + d2[{l - 1) • mi-i] H h d;[l • nq]. 
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Now we will show that mi, • • • ,mi constitute the central (middle) elements 
in the Pascal triangle made of binomial coefficients (here each line should be 
divided 2 powered by the index of the line): 

0 
01 
1 g 1 

1 3 1 

1 4 4 1 

1 5 H0 10 5 1 

It is enough to calculate Er[ri • sgn(ri + • • • + ri)] (since I is an arbitrary 

positive integer). The expression ri • sgn(ri + • • • + ri) is an even function in 

r = (ri, • • • ,ri), so we can assume ri = 1. 

Er[ri • sgn(ri H h n)] = Er[l • sgn(l + r 2 H h p)] = (12) 

Er[sgn(l + r 2 H h n) | |r 2 H h ri| > 1] • Prob{|r 2 H h ri| > 1} + 

sgn(l + 0) • Prob{r 2 + h ri = 0} + 

sgn(l + 1) • Prob{r 2 H h n = 1} + 

sgn(l — 1) • Prob{r 2 + h ri = —1}. 

Note that if |r 2 H hn| > 1 then sgn(l+r 2 H hri) = sgn(r 2 H hri). Now 

taking into account that sgn(r 2 H — • + ri) is an odd function in r = (ri, • • • ,ri), 
we get that the first term in the right-hand side of (12) is zero. Note also that by 
our definition: sgn(l — 1) = sgn(O) = 0, so even the last term in the right-hand 
side of (12) is zero. Consequently from (12) we get: 

Er[ri • sgn(ri H h p)] = (13) 

1 • Prob{r 2 H -|- r/ = 0} -|- 1 • Prob{r 2 H -I- r; = 1}. 

To evaluate the last expression we consider two cases: 

Case 1: I is even: I = 2t. In this case the equality r 2 + - ■ -+ri = 0 is impossible, 
so 

Er[ri • sgn(ri H h p)] = Prob{r 2 H h p = 1} =(14) 

1 /2t - 1 

22*-i \ t-l 



number of different (t — 1) — tuples out of 2t — 1 points 
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Case 2: I is odd: I = + 1. In this case the equality V 2 + ■ ■ ■ + ri = 1 is 

impossible, so 



Er[ri • sgn(ri H h n)] = Prob{r 2 H h n = 0} = (15) 

number of different t — tuples out of 2t points = ^ ' 

Now returning back to the equality (11) we we will prove that 

Imi > {I — l)m;_i > • • • > 1 • mi. (16) 

It suffices to prove that 

(’ + 1)’".+- >1. (17) 



This is easy to do using (14) and (15). If i = 2t {i is even) then using binomial 
formula we get 



{i + l)m^+i 
imi 



2t+l 

2t 



> 1 . 



(18) 



In the case of odd i, i = 2t + 1, we get 



+ = 1, (19) 

imi 

The last two equations give (17), so (16) is proved. Now (16) together with (9) 
show that the right-hand side of (11) will achieve its maximum if we take di as big 
as possible, namely if we take di = 1, which gives that x\ = X 2 = ■ ■ ■ = xi = 1. 
And the Rademacher complexity in this case will be maximal if |ft.(l)| is as big 
as possible. Due to the Lipschitz condition (with constant L = 1) the maximal 
value for |/i(l)| is 1. We can take h{l) = 1 for all r = (ri, • • • ,r;). Evidently the 
Rademacher complexity in this case {x\ = X 2 = ■■■ = xi = 1) is the same as the 
Rademacher complexity of the identical one function Im (for arbitrary choice 
of{x„...,i,)c[0,l]) 

Theorem 6 is proved. 

Proof of Theorem 5. Evidently, Theorem 6 will stay true if instead of 
demanding that the functions vanish at a; = 0 we demand that they vanish at 
a; = 1. Now, any function h{x) which vanishes at some point xq G [0,1] can 
be written as h{x) = hi{x) + h 2 {x), where hi{x) coincides with h{x) on [0,a;o] 
and is identically zero on [xq, 1], and h 2 {x) coincides with h{x) on [a;o, 1] and is 
identically zero on [0,a;o]. Evidently hi{x) vanishes at a; = 1 and h 2 {x) vanishes 
at a; = 0. If, in addition, h{x) is a Lipschitz function with constant L, then 
both h\{x) and h 2 {x) are Lipschitz functions with the same constant. Finally, 
Theorem 5 follows from Theorem 6 and Theorem 2 (part (7)). 
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Proof of Theorem 7 . We have l[o,i](a;) = I{x) + (1 — I{x)) on [0,1]. 
Theorem 2 (part (7)) gives that 



Since the distribution D is symmetric and the two functions I{x) and 1 — I{x) are 
reflections of each other in the vertical line x = 1/2, we get that Ri{I) = i?;(l— /). 
This together with the last inequality proves Theorem 7. 
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Abstract. Kernel functions are typically viewed as providing an im- 
plicit mapping of points into a high-dimensional space, with the ability 
to gain much of the power of that space without incurring a high cost 
if data is separable in that space by a large margin 7. However, the 
Johnson-Lindenstrauss lemma suggests that in the presence of a large 
margin, a kernel function can also be viewed as a mapping to a low- 
dimensional space, one of dimension only 0(1/7^). In this paper, we 
explore the question of whether one can efficiently compute such im- 
plicit low-dimensional mappings, using only black-box access to a kernel 
function. We answer this question in the affirmative if our method is also 
allowed black-box access to the underlying distribution (i.e., unlabeled 
examples). We also give a lower bound, showing this is not possible for 
an arbitrary black-box kernel function, if we do not have access to the 
distribution. We leave open the question of whether such mappings can 
be found efhciently without access to the distribution for standard kernel 
functions such as the polynomial kernel. 

Our positive result can be viewed as saying that designing a good kernel 
function is much like designing a good feature space. Given a kernel, 
by running it in a black-box manner on random unlabeled examples, 
we can generate an explicit set of 0(1/7^) features, such that if the 
data was linearly separable with margin 7 under the kernel, then it is 
approximately separable in this new feature space. 



1 Introduction 

Kernels and margins have been a powerful combination in Machine Learning. A 
kernel function implicitly allows one to map data into a high-dimensional space 
and perform certain operations there without paying a high price computation- 
ally. Furthermore, if the data indeed has a large margin linear separator in that 
space, then one can avoid paying a high price in terms of sample size as well [6, 
7 , 9 , 11 , 13 , 12 , 14 , 15 ]. 

The starting point for this paper is the observation that if a learning problem 
indeed has the large margin property under some kernel K{x,y) = 4>{x) ■ 4>{y), 
then by the Johnson-Lindenstrauss lemma, a random linear projection of the 
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“(/)-space” down to a low dimensional space approximately preserves linear sep- 
arability [1,2,8,10]. Specifically, if a target function has margin 7 in the </)-space, 
then a random linear projection of the (/)-space down to a space of dimension 
d = O log will, with probability at least 1 — i5, have a linear separator of 

error at most e (see, e.g., Arriaga and Vempala [2] and also Theorem 3 of this 
paper). This means that for any kernel K and margin 7, we can, in principle, 
think of K as mapping the input space X into an 0(l/7^)-dimensional space, 
in essence serving as a representation of the data in a new (and not too large) 
feature space. 

The question we consider in this paper is whether, given kernel K, we can 
in fact produce such a mapping efficiently. The problem with the above obser- 
vation is that it requires explicitly computing the function 4>{x). In particular, 
the mapping of X into is a function F{x) = A(f>{x), where A is a random 
matrix. However, for a given kernel K, the dimensionality and description of 
4>{x) might be large or even unknown. Instead, what we would like is an efficient 
procedure that given K {., .) as a black-box program, produces a mapping with 
the desired properties but with running time that depends (polynomially) only 
on 1/7 and the time to compute the kernel function K, with no dependence on 
the dimensionality of the </>-space. 

Our main result is a positive answer to this question, if our procedure for 
computing the mapping is also given black-box access to the distribution D 
(i.e., unlabeled data). Specifically, given black-box access to a kernel function 
K{x,y), a margin value 7, access to unlabeled examples from distribution D, 
and parameters e and S, we can in polynomial time construct a mapping of the 
feature space F : X ^ R‘^ where d = O ^^plog^j, such that if the target 
concept indeed has margin 7 in the ^space, then with probability 1 — <5 (over 
randomization in our choice of mapping function), the induced distribution in 
R‘^ is separable with error < e. 

In particular, if we set e <C e^7^, where e' is our input error parameter, 
then the error rate of the induced target function in R'^ is sufficiently small 
that a set S of 0{d/e') labeled examples will, with high probability, be per- 
fectly separable in the mapped space. This means that if the target function 
was truly separable with margin 7 in the (()-space, we can apply an arbitrary 
zero-noise linear-separator learning algorithm in the mapped space (such as a 
highly-optimized linear-programming package). In fact, with high probability, 
not only will the data in R‘^ be separable, but it will be separable with margin 
7/2. However, while the dimension d has a logarithmic dependence on 1/e, the 
number of (unlabeled) examples we use to produce the mapping is 0(l/(7^e)). 

Given the above results, a natural question is whether it might be possible 
to perform mappings of this type without access to the underlying distribution. 
In Section 5 we show that this is in general not possible, given only black-box 
access (and polynomially-many queries) to an arbitrary kernel K. However, it 
may well be possible for specific standard kernels such as the polynomial kernel 
or the gaussian kernel. 
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Our goals are to some extent related to those of Ben-David et al [4,5]. They 
show negative results giving simple learning problems where one cannot con- 
struct mappings to low-dimensional spaces that preserve separability. We restrict 
ourselves to situations where we know that such mappings exist, but our goal is 
to produce them efficiently. 

Outline of results: We begin in Section 3 by giving a simple mapping into a 
d-dimensional space for d = 0(j[:ip -I- In yj) that approximately preserves both 
separability and margin. This mapping in fact is just the following: we draw a 
set S of d examples from D, run K{x, y) over all pairs x,y € S to place S exactly 
into and then for general x G X define F{x) to be the orthogonal projection 
of 4>{x) down to this space (which can be computed using the kernel). That is, 
this mapping can be viewed as an orthogonal projection of the (/espace down to 
the space spanned by 4>{S). In Section 4, we give a more sophisticated mapping 
to a space of dimension only 0(;:plog^). This logarithmic dependence then 
means we can set e: small enough as a function of the dimension and our input 
error parameter that we can then plug in a generic zero-noise linear separator 
algorithm in the mapped space (assuming the target function was perfectly sep- 
arable with margin 7 in the <()-space) . In Section 5 we argue that for a black-box 
kernel, one must have access to the underlying distribution D if one wishes to 
produce a good mapping into a low-dimensional space. Finally, we give a short 
discussion in Section 6 . 

An especially simple mapping: We also note that a corollary to one of our re- 
sults (Lemma 1) is that if we are willing to use dimension d = + In yj) 

and we are not concerned with preserving the margin and only want approxi- 
mate separability, then the following especially simple procedure suffices. Just 
draw a random sample of d unlabeled points xi,...,Xd and define F{x) = 
(K{x,xi), . . . ,K{x, Xd))- That is, we define the fth “feature” of x to be K{x, Xi). 
Then, with high probability, the data will be approximately separable in this d- 
dimensional space if the target function had margin 7 in the <f space. Thus, this 
gives a particularly simple way of using the kernel and distribution for feature 
generation. 



2 Notation and Definitions 

We assume that data is drawn from some distribution D over an instance space 
X and labeled by some unknown target function c : X — >■ {— 1,4-1}. We use P 
to denote the combined distribution over labeled examples. 

A kernel AT is a pairwise function K{x,y) that can be viewed as a “legal” 
definition of inner product. Specifically, there must exist a function (p mapping X 
into a possibly high-dimensional Euclidean space such that K{x, y) = 4>{x) ■ 4>{y) . 
We call the range of (p the “())-space”, and use p{D) to denote the induced 
distribution in the ())-space produced by choosing random x from D and then 
applying (p{x). 
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We say that for a set S of labeled examples, a vector w in the ^space has 
margin 7 if: 

min 

(x,e)es 

That is, w has margin 7 if any labeled example in S is correctly classified by the 
linear separator w ■ 4 >{x) > 0, and furthermore the cosine of the angle between 
w and 4 >{x) has magnitude at least 7. If such a vector w exists, then we say that 
S is linearly separable with margin 7 under the kernel K. For simplicity, we are 
only considering separators that pass through the origin, though our results can 
be adapted to the general case as well. 

We can similarly talk in terms of the distribution P rather than a sample S. 
We say that a vector w in the (/)-space has margin 7 with respect to P if: 



(j){x) 



I kll</'(a;)l I 



> 7 - 



w ■ (j){x) 

Pr ^-j— iTIFTi < 



= 0 . 



If such a vector w exists, then we say that P is (perfectly) linearly separable 
with margin 7 under K. One can also weaken the notion of perfect separability. 
We say that a vector w in the (/)-space has error a at margin 7 if: 



(x,£)eP _ |w||(/)(x)| 



< 7 



< a. 



Our starting assumption in this paper will be that P is perfectly separable 
with margin 7 under K, but we can also weaken the assumption to the existence 
of a vector w with error a at margin 7, with a corresponding weakening of 
the implications. Our goal is a mapping F : X ^ R‘^ where d is not too 
large that approximately preserves separability. We use F{D) to denote the 
induced distribution in produced by selecting points in X from D and then 
applying F, and use F{P) = F{D, c) to denote the induced distribution on 
labeled examples. 

For a set of vectors vi,V2, ■ ■ ■ ,Vk in Euclidean space, let span(ui, . . . , Ufc) 
denote the span of these vectors: that is, the set of vectors v that can be written 
as a linear combination aiVi -|- . . . -I- UkVk- Also, for a vector v and a subspace 
Y, let proj(u,E) be the orthogonal projection of v down to Y. So, for instance, 
proj(r:, span(vi, . . . , Vfc)) is the orthogonal projection of v down to the space 
spanned by V\, . . . ,Vk- We note that given a set of vectors V\, . . . ,Vk and the 
ability to compute dot-products, this projection can be computed efficiently by 
a solving a set of linear equalities. 



3 A Simpler Mapping 

Our goal is a procedure that given black-box access to a kernel function K{., .), 
unlabeled examples from distribution D, and a margin value 7, produces a (prob- 
ability distribution over) mappings F : X ^ R’^ such that if the target function 
indeed has margin 7 in the ^space, then with high probability our mapping will 




198 



M.-F. Balcan, A. Blum, and S. Vempala 



preserve approximate linear separability. In this section, we analyze a method 

where e is our bound 



that produces a space of dimension O 






on the error rate of the best separator in the mapped space. We will, in fact, 
strengthen our goal somewhat to require that F(P) be approximately separable 
at margin 7/2 (rather than just approximately separable) so that we can use 
this mapping as a first step in a better mapping in Section 4. 

Informally, the method is just to draw a set S' of c? examples from D, and 
then (using the kernel K) to define F{x) so that it is equivalent to an orthogonal 
projection of 4>{x) down to the space spanned by 4>{S). 

The following lemma is key to our analysis. 



Lemma 1. Consider any distribution over labeled examples in Euclidean space 
such that there exists a vector w with margin 7. Then if we draw 



n> - 
e 



1 , 1 
7^ 0 



examples zi,...,z„ iid from this distribution, with probability > 1 — <5, there 
exists a vector w' in span(zi, . . . , z„) that has error at most s at margin 7/2. 



Proof. We give here two proofs of this lemma. The first (which produces a some- 
what worse bound on n) uses the machinery of margin bounds. Margin bounds 
[12,3] tell us that using n = 0(^[^ log^ ( l/ye) -I- log j]) points, with high proba- 
bility, any separator with margin > 7 over the observed data has a low true error 
rate. Thus, the projection of the target function w into this space will have a 
low error rate as well. (Projecting w into this space maintains the value of w ■ Zi, 
while possibly shrinking the vector w, which can only increase the margin over 
the observed data.) The only technical issue is that we want as a conclusion for 
the separator not only to have a low error rate over the distribution, but also to 
have a large margin. However, we can easily get this from the standard double- 
sample argument. Specifically, rather than use a 7/2-cover as in the standard 
margin bound, one can use a 7/4-cover. When the double sample is randomly 
partitioned into (S'i,S'2), it is unlikely that any member of this cover will have 
zero error on Si at margin 87/4, and yet substantial error on S2 at the same 
margin, which then implies that (since this is a 7/4-cover) no separator has zero 
error on Si at margin 7 and yet substantial error on S2 at margin 7/2. 

However, we also note that since we are only asking for an existential state- 
ment (the existence of w'), we do not need the full machinery of margin bounds, 
and give a second more direct proof (with better bounds on n) from first prin- 
ciples. For any set of points S, let Win{S) be the projection of w to span(S'), 
and let Wout{S) be the orthogonal portion of w, so that w = Win{S) + Wout{S) 
and Win{S) T Wout{S). Also, for convenience, assume w and all examples z 
are unit-length vectors (since we have defined margins in terms of angles, we 
can do this without loss of generality). Now, let us make the following defini- 
tions. Say that Wout{S) is large if Pr2(|wo«t(<S') • z\ > 7/2) > e, and otherwise 
say that Wout{S) is small. Notice that if Wout{S) is small, we are done, be- 
cause w ■ z = {win{S) ■ z) + {wout{S) ■ z), which means that Win{S) has the 
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properties we want. On the other hand, if Wout{S) is large, this means that a 
new random point z has at least an e chance of improving the set S. Specif- 
ically, consider z such that |wout(S') • z| > 7 / 2 . For S' = S'lJ{z}, we have 
Wout(S') = Wout{S) - proj{wout{S) ,spa,n{S')) = Wout{S) - {wout{S) ■ z')z' , where 
z' = (z — proj(z,span(S')))/|z — proj(z,span(S'))| is the portion of z orthogonal to 
span(S'), stretched to be a unit vector. But since \wout{S)- z'\ > | Wo«t( S') • z|, this 
implies that |wo«i(S')P < |wo«i(S)P - ( 7 / 2 )^. Now, since |wp = |wo«i(0)|^ = 1 
and |wout(S)| can never become negative, this can happen at most 4 / 7 ^ times. 
So, we have a situation where so long as Wout is large, each example has at least 
an e chance of reducing \wout\'^ by at least 7^/4. This can happen at most 4 / 7 ^ 
times, so Chernoff bounds imply that with probability at least 1 — <5, Wout{S) 

will be small for S a sample of size > - 






□ 



Lemma 1 implies that if P is linearly separable with margin 7 under K, and 
we draw n = | + In |] random unlabeled examples a;i, . . . , from D, with 
probability at least 1 — <5 there is a separator w' in the </>-space with error rate 
at most e that can be written as 



w' = -I- ... -I- an4>{xn)- 

Notice that since w' ■(j){x) = aiK{x,x\) + . . . + anK{x,Xn), an immediate impli- 
cation is that if we simply think of K{x, Xi) as the zth “feature” of a; — that is, 
if we define F{x) = {K{x, x\), . . . ,K{x, x„)) — then with high probability F{P) 
will be approximately linearly separable as well. So, the kernel and distribution 
together give us a particularly simple way of performing feature generation that 
preserves (approximate) separability. 

Unfortunately, the above mapping F may not preserve margins because we 
do not have a good bound on the length of the vector (oi, . . . , a„) defining the 
separator in the new space. Instead, to preserve margin we want to perform an 
orthogonal projection. Specifically, we draw a set S = {x\, ...,Xn} of + 
In^]) unlabeled examples from D and run K{x,y) for all pairs x,y G S. Let 
M{S) = {K{xi,Xj))xi,xj£S be the resulting kernel matrix. We use M{S) to 
define an embedding of S into i?" by Cholesky Factorization. More specifically, 
we decompose M(S) into M(S) = U'U, where U is an upper triangular matrix, 
and we define our mapping F{xj) to be the j’th column of U . 

We next extend the embedding to all of X by considering F : X ^ i?" to 
be a mapping defined as follows: for x G X, let F{x) G i?” be the point such 
that F{x) ■ F{xi) = K{x,Xi), for all i G {1, In other words, this mapping 

is equivalent to orthogonally projecting (/>(x) down to span(i/>(xi), . . . , c/>(xn))- 
We can compute F(x) by solving the system of linear equations [F{x)]' U = 
{K{x,xi),...,K{x,Xn)). 

We now claim that by Lemma 1, this mapping F maintains approximate 
separability at margin 7 / 2 . 

Theorem 1 . Given e,<5, 7 < 1, if P has margin 7 in the (f-space, then with 
probability > 1 — i5 our mapping F (into the space of dimension n) has the 
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property that F{P) is linearly separable with error at most e at margin 7 / 2 , 



given that we use n> j 






unlaheled examples. 



Proof. Since 4>{D) is separable at margin 7, it follows from Lemma 1 that, for 
> f + In y , with probability at least 1 — <5, there exists a vector w that 
can be written as w = aif>(xi) + ... + an(/>(xn), that has error at most s at margin 
7/2 (with respect to (j){P)), i.e., 



Pr 

(x,e)£P 



£(w ■ 4>{x)) 7 

L kll</'(a^)l ^ 2 



< e. 



Consider w G i?", w = aiF{xi) + ... + anF{xn). Since |uJ| = |t(;| and since 
w ■ (p{x) = w ■ F{x) and |f(a;)| < \4>{x)\ for every a; G A", we get that w has error 
at most e at margin 7/2 (with respect to F{P)), i.e., 



Pr 

{x.e)ep 



i{w ■ F{x)) 7 

L Nl^(^)l ^2 



< £. 



Therefore, for our choice of n, with probability at least 1 — <5 (over randomization 
in our choice of F), there exists a vector w G i?" that has error at most £ at 
margin 7/2 with respect to F{P). □ 



Notice that the running time to compute F(x) is polynomial in I/7, 1/e, 1/S 
and the time to compute the kernel function K. 



4 An Improved Mapping 

We now describe an improved mapping, in which the dimension d has only a 
logarithmic, rather than linear, dependence on 1/e. The idea is to perform a 
two-stage process, composing the mapping from the previous section with a ran- 
dom linear projection from the range of that mapping down to the desired space. 
Thus, this mapping can be thought of as combining two types of random projec- 
tion: a projection based on points chosen at random from D, and a projection 
based on choosing points uniformly at random in the intermediate space. 

We begin by stating a result from [2] that we will use. Here iV(0, 1) is the 
standard Normal distribution with mean 0 and variance 1 and [/(— 1,1) is the 
distribution that has probability 1/2 on —1 and probability 1/2 on 1 . 

Theorem 2 (Neuronal RP [2]). Let u,v G i?". Let u' = and v' = 

where A is a random matrix whose entries are chosen independently from 

either N (0,1) orU{—l,l)- Then, 



Pr (1 — £)|u — up < |u' — ri'P < (1 -I- £)|m — wp > 1 — 2e 
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Let Fi : X — >• i?” be the mapping from Section 3, with e/2 and (5/2 as its 
error and confidence parameters respectively. Let F 2 : i?” — >■ be a random 

projection as in Theorem 2. Specifically, we pick ^ to be a random dx n matrix 
whose entries are chosen i.i.d. iV(0, 1) or C/(— 1, 1) (i.e., uniformly from {—1, 1}). 
We then set F 2 {x) = -^Ax. We finally consider our overall mapping F ■. X ^ R’^ 
to be F{x) = F 2 {Fi{x)). 



We now claim that for n = O 



{i 



■1”! 



and d = O log(^)) , with 



high probability, this mapping has the desired properties. The basic argument is 
that the initial mapping Fi maintains approximate separability at margin 7/2 by 
Lemma 1, and then the second mapping approximately preserves this property 
by Theorem 2. 



Theorem 3. Given e, d, 7 < 1, if P has margin 7 in the (p-space, then with prob- 
ability at least 1 — (5, our mapping into the space of dimension d = O log(^)^ 
has the property that F{P) is linearly separable with error at most e at margin 



at most 7/4, given that we use n = O Q 



Ini 



unlabeled examples. 



Proof. By Lemma 1, with probability at least 1 — 5/2 there exists a separator 
w in the intermediate space i?" with error at most e/2 at margin 7/2. Let us 
assume this in fact occurs. Now, consider some point x G i?". Theorem 2 implies 
that under the random projection F 2 , with high probability the lengths of w, 
X, and w — X are all approximately preserved, which implies that the cosine 
of the angle between w and x (i.e., the margin of x with respect to w) is also 

approximately preserved. Specifically, lor d = O T;^log(^)V we have: 



For all X, Pr 
A 

This implies 



w ■ X 
|w||a;| 



F2(w) ■ F2(x) 
\F2{w)\\F2{x)\ 



> 7/4 



< ed/4. 



Pr 

xeFi(D),A 



W ■ X 

kikl 



F2{w) ■ F2{x) 

|i^2(w)ll^2(a;)| 



> 7/4 



< e^/4, 



which implies that 



Pr 

A 



Pr 

xeFi(D) 



W ■ X 

kikl 



F2{w) ■ F2{x) 

|-F2(w)II^2(x)| 




> e/2 



< 5/2. 



Since w has error < e/2 at margin 7/2, this then implies that the probability 
that F 2 {w) has error more than e over F 2 {Fi{D)) at margin 7/4 is at most 5/2. 
Combining this with the 5/2 failure probability of Fi completes the proof. □ 



As before, the running time to compute our mappings is polynomial in 
1/7, 1/e, 1/5 and the time to compute the kernel function K. 

Corollary 1. Given e',(5, 7 < 1, if P has margin 7 in the p-space then we 
can use n = 0(l/(e'7^)) unlabeled examples to produce a mapping into R’^ for 
d = 0(::^ log j7^), that with probability 1 — (5 has the property that F{P) is 
linearly separable with error <C e' /d. 
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Proof. Just plug in the desired error rate into the bounds of Theorem 3. □ 

Note that we can set the error rate in Corollary 1 so that with high probability 
a random labeled set of size 0{d/e') will be linearly separable, and therefore any 
linear separator will have low error by standard VC-dimension arguments. Thus, 
we can apply an arbitrary linear-separator learning algorithm in R‘^ to learn the 
target concept. 

5 On the Necessity of Access to D 

Our main algorithm constructs a mapping F : X ^ R‘^ using black-box access 
to the kernel function K(x,y) together with unlabeled examples from the input 
distribution ZJ. It is natural to ask whether it might be possible to remove the 
need for access to U. In particular, notice that the mapping resulting from the 
Johnson-Lindenstrauss lemma has nothing to do with the input distribution: 
if we have access to the </>-space, then no matter what the distribution is, a 
random projection down to R‘^ will approximately preserve the existence of a 
large-margin separator with high probability. So perhaps such a mapping F can 
be produced by just computing K on some polynomial number of cleverly-chosen 
(or uniform random) points in In this section, we give an argument showing 
why this may not be possible for an arbitrary kernel. This leaves open, however, 
the case of specific natural kernels. 

In particular, consider X = {0,1}", let X' be a random subset of 2"/^ el- 
ements of X, and let D be the uniform distribution on X' . For a given target 
function c, we will define a special (^function (pc such that c is a large margin 
separator in the (/)-space under distribution D, but that only the points in X' 
behave nicely, and points not in X' provide no useful information. Specifically, 
consider (pc '■ X ^ R? defined as: 



r(l,0) iix^X' 

pc{x) = < (— 1/2, -\/3/2) ii X & X' and c{x) = 1 
\ (—1/2, — v^/2) a X & X' and c{x) = —1 

See figure 1. This then induces the kernel: 

, . _ J 1 ii x,y ^ X' or [x,y & X' and c{x) = c{y)] 

J^c[x,y) |_i /2 otherwise 

Notice that the distribution P = {D,c) over labeled examples has margin 7 = 
1/2 in the (/-space. 

Now, consider any algorithm with black-box access to K attempting to create 
a mapping F : X ^ R‘^. Since X' is a random exponentially-small fraction of X, 
with high probability all calls made to K return the value 1. Furthermore, even 
though at “runtime” when x is chosen from D, the function F(x) may itself call 

Let’s assume A is a “nice” space such as the unit ball or (0, 1}". 



1 
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Fig. 1. Function 0c used in lower bound. 



K{x,y) for different previously-computed points y, with high probability these 
will all give K(x,y) = —1/2. In particular, this means that the mapping F is 
with high probability independent of the target function c. Now, since X' has 
size 2"/^, there are exponentially many orthogonal functions c over D, which 
means that with high probability F{D,c) will not even be weakly separable for 
a random function c over X' unless d is exponentially large in n. 

Notice that the kernel in the above argument is positive semidefinite. If we 
wish to have a positive definite kernel, we can simply change “ 1 ” to “1 — a” and 
“—1/2” to “— 5(1 — a)” in the definition of K{x,y), except for y = x in which 
case we keep K(x,y) = 1. This corresponds to a function 0 in which rather 
that mapping points exactly into we map into R?'^^ giving each example a 
•ya-component in its own dimension, and we scale the first two components by 
yr — a to keep (j>c{x) a unit vector. The margin now becomes |(1 — a). Since the 
modifications provide no real change (an algorithm with access to the original 
kernel can simulate this one), the above arguments apply to this kernel as well. 

Of course, these kernels are extremely unnatural, each with its own hidden 
target function built in. It seems quite conceivable that positive results indepen- 
dent of the distribution D can be achieved for standard, natural kernels. 

6 Discussion and Open Problems 

Our results show that given black-box access to a kernel function K and a distri- 
bution D (i.e., unlabeled examples) we can use K and D together to construct a 
new low-dimensional feature space in which to place our data that approximately 
preserves the desired properties of the kernel. 

We note that if one has an unkernelized algorithm for learning linear sep- 
arators with good margin-based sample-complexity bounds, then one does not 
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necessarily need to perform a mapping first and instead can apply a more direct 
method. Specifically, draw a sufficiently large labeled set S as required by the 
algorithm’s sample-complexity requirements, compute the kernel matrix K{x, y) 
to place S into and use the learning algorithm to find a separator h in that 
space. New examples can be projected into that space using the kernel function 
(as in Section 3) and classified by h. Thus, our result is perhaps something more 
of interest from a conceptual point of view, or something we could apply if one 
had a generic (e.g., non-margin-dependent) linear separator algorithm. 

One aspect that we find conceptually interesting is the relation of the two 
types of “random” mappings used in our approach. On the one hand, we have 
mappings based on random examples drawn from D, and on the other hand we 
have mappings based on uniform (or Gaussian) random vectors in the </>-space 
as in the Johnson-Lindenstrauss lemma. 

Our main open question is whether, for natural standard kernel functions, 
one can produce mappings F : X ^ R'^ in an oblivious manner, without using 
examples from the data distribution. The Johnson-Lindenstrauss lemma tells us 
that such mappings exist, but the goal is to produce them without explicitly 
computing the 0-function. 
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Abstract. In the field of pattern recognition or outlier detection, it is 
necessary to estimate the region where data of a particular class are gen- 
erated. In other words, it is required to accurately estimate the support of 
the distribution that generates the data. Considering the 1-dimensional 
distribution whose support is a finite interval, the data region is esti- 
mated effectively by the maximum value and the minimum value in the 
samples. Limiting distributions of these values have been studied in the 
extreme-value theory in statistics. In this research, we propose a method 
to estimate the data region using the maximum value and the minimum 
value in the samples. We calculate the average loss of the estimator, and 
derive the optimally improved estimators for given loss functions. 



1 Introduction 

In the field of pattern recognition or outlier detection, it is often necessary to 
estimate the region where data of a particular class are generated. By setting a 
boundary that identifies the data region by learning from only examples of the 
target class, a new input is detected whether it is in the target class or it is, in 
fact, unknown. 

Different methods have been developed to estimate the data region. Most 
often the probability density of the data is estimated using Parzen density esti- 
mation or Gaussian mixture model[3]. In these methods, the region over some 
probability threshold is estimated. However, setting the thresholds plagues these 
techniques. Assuming the data are generated from a probability distribution that 
has a compact support, the more precise prediction is realized by more accurately 
estimating the support of the distribution. For this purpose, some methods in- 
spired by the support vector machines were proposed [4] [5]. 

However, in a parametric model of distribution whose support is a finite in- 
terval and depends on a parameter, the log-likelihood of the model diverges. 
Hence, the regularity conditions of statistical estimation do not hold. The prop- 
erties of such non-regular models have been clarified in some studies. It has been 
known that they have quite different properties from those of regular statistical 
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models [2]. In estimating the support of the 1-dimensional distribution, the max- 
imum and minimum values in the samples are used effectively. The properties of 
these values in the samples taken from a distribution have been studied in the 
theory of extreme- value statistics[l][6]. 

In this research, we propose a new method to estimate the data region us- 
ing the maximum and minimum values in the samples. We derive the asymp- 
totic distributions of the maximum and minimum values based on the theory 
of extreme-value statistics. By calculating the average loss of the estimator, the 
optimally improved estimators for given loss functions are derived. 

In Section 2, the general theory of extreme- value statistics is introduced. In 
Section 3, we propose a method to estimate the data region and derive the aver- 
age loss of the method. This method is derived by assuming some conditions on 
the distribution that generates the data. Therefore, we need to estimate some 
parameters of the distribution in order to use this method without any infor- 
mation of these parameters. In Section 4, a method to estimate these additional 
parameters is given. The efficiency of this method is demonstrated by simulation 
in Section 5. 

The proposed method can also be applied to estimate a multi-dimensional 
data region by taking a 1-dimensional data set from multi-dimensional data. 
Using some mappings, some sets of 1-dimensional data are obtained. Then, a 
multi-dimensional data region is estimated by applying the proposed method 
to each 1-dimensional data set. In the latter half of Section 5, we present an 
experimental study on 2-dimensional data and demonstrate experimentally that 
the proposed method works effectively to estimate a 2-dimensional data region. 
Discussion and conclusions are given in Section 6. The proofs of theorems in 
Section 3 are in Appendix. 

2 Extreme- Value Statistics 

Suppose that A" = {Ai, A 2 , • • • , A„} is a set of n samples independently and 
identically taken from a probability distribution with a density function f{x) 
and a distribution function F{x). Putting these samples in the order so that 

A(i) < A(2) < • • • < A(„), 

the random variable defined by the fcth smallest value A(fe)(l < k < n) is 
called the order statistic. Especially, the extremely large or small variables such 
as maxi<i<„Aj, and mini<j<„ A^ are called extreme-value statistics. In this 
section, we discuss the maximum value M„ = maxi<i<„ Aj. 

The distribution function of is given by 

P{Mn < a;} = P{Xi < X, - ■ ■ ,Xn < x} 

= F(x)”. 

Differentiating this distribution function, we obtain the density function fmaxiP) 
of M„ as follows. 



fmax{x) = nf{x)F{x)"‘ ^ 
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The following property of the maximum value M„ = maxi<i<„Xi is well 
known on its asymptotic distribution function F{x)'^ as n ^ oo[l][6]. 

If, for some sequences a„(> 0) and the distribution function of 



^n) 



has an asymptotic distribution function G{x), then G{x) should have one of the 
following three forms. 



1 . 

2 . 



G{x) = exp(— e (— oo < x < oo), 



G(x) = 



exp(— X “) 



(x < 0), 
{x > 0), 



Q r(rr.\ - I exp(-(-x)“) (x < 0), 

\ 1 (x > 0), 

for some a > 0. 

The minimum value mini<i<„ Xi also has the same property and correspond- 
ing three forms of distribution are known. 



3 Estimation of the Data Region 

We assume the following conditions on the density function f(x). 

O', f f(x) >0 (6 < X < a), 

\ f{x) = 0 (otherwise). 

(ii) For 0 < a, P < oo and 0 < A, B < oo, 

lim (a-x)^"“/(x) = A, 

x—^a—0 

lim {x — by~^ f{x) = B. 

x-^b-\-0 

The condition (i) means that the support of the distribution is a finite interval 
[b,a\. The condition (ii) means the behaviors of /(x) around the endpoints a 
and b are determined by the parameters a, A and P, B respectively. We see 
that in the neighborhood of a, /(x) diverges for 0 < a < 1 and converges to 
the positive number A for a = 1 or to 0 for a > 1, and the same holds for 
the parameter P around b. For example, when /(x) is the Beta distribution on 
the interval [0,1], that is, /(x) = (1 — x)‘^^~^ where B{pi,p 2 ) = 

fg these conditions are satisfied with 

a=l, b = 0, a = pi, P = p 2 , A = B= ^ 
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Under the above conditions, we estimate a and b, the endpoints of the density 
f{x), by the estimators a,b. We define the estimators a, 5 using the maximum 
and minimum values in the samples, 



a = max Xi + 

l<i<n 



Cg 

r}/a ’ 



( 1 ) 



b = min Xi — 

Ki<n 



Cb 



(2) 



where Ca and Cb are the coefficients that determine the correction from the max;- 
imum and minimum values respectively. The average loss of these estimators is 
defined later. Then we calculate the average loss and derive the optimal Ca and 
Cb that minimize the loss. Before defining the average loss of a and 6, let us 
consider the asymptotic distributions of the maximum and minimum values in 
the samples that determine the properties of the estimator a and b. Following 
Theorem 1 and Theorem 2 are obtained. 



Theorem 1. For M„ = maxi<i<„Xj, = mini<i<„ W; the asymptotic dis- 
tributions of — a) and — 6) have following distribution 

functions Gmax(a;) and Gmin{x), 



Gmax(^) 



exp(— (— x)“) (x < 0), 

1 (x > 0), 



® (a; < 0), 

- I 1 _ exp(-X^) (x > 0). 



Theorem 2. Denoting by G„(s,t) the joint distribution function of 
-a), (fn)^/^(mn - &)), 

lini = Gmax('^)^min(^)- 

n— >-oo 

From Theorem 2 it is noted that the maximum and minimum m„ are 
asymptotically independent of each other. Therefore we consider the estimation 
of each endpoints separately. More specifically, we put the left endpoint 6 = 0 
and consider estimating only the right endpoint a > 0 hereafter. The estimator 
6 of the left endpoint 6 and the coefficient Cb are determined by the estimator d 
and coefficient Ca derived from the set of samples {-Xi,---, —Xn}. 

We define the average loss of the estimator a eq.(l) as 






U{\a 




( 3 ) 



where Ux"[’] denotes the expectation value over all sets of samples and the 
function U (x) is an arbitrary analytic function which satisfies U(0) = 0, and 
C/(x) > 0. 

In order to evaluate the average loss eq.(3), we prove the following Theorem 3 
and Theorem 4. 
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I 

Fig. 1. Probability density of T = \a — a\ (a = 1, 2, 3) 



alpha = 1 
alpha = 2 
alpha = 3 




Theorem 3. Denoting by Hn{t) the distribution function of — a)^ for 

any odd number k > 0 and by Hn{t) the distribution function o/n^/“|d — a|* for 
any natural number k, 



lim Hn{t) 

n— >-oo 



1 it > C^a), 

exp(-f(ca-ii/'=)“) (0<t<C^), 



( 4 ) 



and 

1 - jT m = / ^ “ exp(-f (ca + ft > cff), 

™ \exp(-|(ca-fi/'=)“)-exp(-|(ca + ii/'=)“) (0<t <c^). 

( 5 ) 

If Ca = 0, the above asymptotic distribution functions are Weibull distribu- 
tion functions. 

By differentiating the above distribution functions eq.(5), we obtain the 
asymptotic density functions of — a|*. We illustrate them in Fig. 1 for 

fc = 1, a = 1, 2, 3, and in Fig. 2 for fc = 2, a = 1, 2, 3 In these figures the values 
of the coefficients Co are set to be optimal described below. 

Evaluating the expectation values of — a)^ and n^/“|a — a|^, we obtain 

the following theorem. 

Theorem 4. For an arbitrary natural number k, 



lim Ex” 



r'=/“(o-a)'= 






and if k is an odd number, 



lim Ex'" 

n—^oo 






= 4 + E4-(")(5)‘-<-in27(',-c;)-r(l)}, 
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Fig. 2. Probability density of T = n<^ \a — op (a = 1,2,3) 



where ^{x,p) and r{x) are respectively the incomplete gamma function and the 
gamma function. The definitions of these functions are 

rp foo 

-){x,p) = / t^~^e~*dt, r{x) = / t^~^e~*dt. 

Jo Jo 

From Theorem 4 we obtain the following two corollaries on the average loss 
eq.(3). 

Corollary 1. If ai = yf 0, 



E 



Xn 



U{\a-d\) 



ai 

„l/a 






Denoting the optimal coefficient Ca that minimizes this average loss by c*, 

■=:=()|iog2)". 



Proof. Since 



U{\a — a|) = oi|a — o| -I- 0{\a — ap), 



put A: = 1 in Theorem 4. We obtain c* by differentiating the average loss with 
respect to Ca using 

dj{x,p) 



dp 



□ 



Corollary 2. If ai = = 0 and 02 = | ^ 0, 



E 



X” 






and 
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Proof. Since 



U{\a — a|) = 02 |a — ap + 0{\a — ap), 



put fc = 2 in Theorem 4. 



□ 



4 Estimation of the Other Parameters 

In the previous section, the optimal estimators were derived under the conditions 
(i) and (ii) with parameters a, (3, A, B. These parameters determine the behavior 
of the density function f(x) around the endpoints. If we estimate the data region 
without any information on these parameters, we need to estimate them using 
samples. In this section, we consider the case only the right endpoint is estimated 
and propose a method to estimate the parameters a and A of condition (ii) in 
the previous section. 

The asymptotic distribution of the maximum value in the k samples is 
uniquely specified if parameters a, A and endpoint a are given, and its density 
function pk{x\a, A,a) is given by 

Ak 

Pk{x\a, A, a) = Ak(a — x)°‘~^ exp( (a — a;)“), (6) 

a 

from Theorem 1 in the previous section. 

Suppose that m samples Xi,X 2 ,-''j^m are independently and identically 
taken from the above distribution. The log-likelihood function L{a, A, a) for the 
m samples is given by 



m 

L{a, A, a) = ^{log Ak -b (a — 1) log(a — Xi) 

i=l 



Ak 

a 



(a - Xi)“}. 



(7) 



By dividing n samples taken from f{x) into m groups and taking the maxi- 
mum value in each group, the samples of maximum values are obtained. Assum- 
ing these are taken from the density eq.(6), we propose to estimate parameters 
a and A by maximizing the log-likelihood eq.(7) in the case k = n/m. Parame- 
ters f3, B and the left endpoint b is estimated in the same way using the set of 
samples {—Xi, • • • , —X„}. 

The log-likelihood eq.(7) can be maximized by an iterative algorithm. How- 
ever, the maximization with respect to three variables a, A and a sometimes 
makes the log-likelihood to diverge. For this reason, based on the result of the 
previous section, we propose the more stable and reduced variate method where 
only two variables a and A need to be estimated. 

In the previous section, the endpoint a was estimated by the estimator 



a = 



max Xi + 

Ki<n 



ni/a 



with given a and A. From Corollary 1 and Corollary 2 in the previous section, c* 
that minimizes the average loss is represented by parameters a and A. Therefore, 
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the log-likelihood function eq.(7) becomes a function of two parameters a and 
A by replacing the parameter a with that of the form 

a=\n&^Xi+-^. ( 8 ) 

l< 2 <n 77,-*-/“ 



Denoting by L{a, A) this log-likelihood function with two variables, we propose 
to estimate parameters a and A by maximizing L(a, A) with an iterative algo- 
rithm. The gradient of L{a,A) used in the iterative algorithm is given 

by 



da 



m 

{l - — (a - Xi)“} log(a - a;i) 



i=l 



„ If , . , , da 

H — -\ {a - 1 - Ak{a - Xi) } — 

a — Xi da 



fl k . , 1 (- , da 

= - 7 --(a -a;*) + {a - 1 - Ak{a - Xi) } — 

dA ^lA a a — Xi ^ dA 



i=l 



where 



da 

da 

da 




a log 2 
An 




1 

aA 




using c* of Corollary 1 in eq.(8), or 



da 

da 

da 

Wa 



_{(i_i„g£.)r(i + -)-r(i + -)}( 



h\(^\ 

An) 



f— V 

a A \An) 



ni + -), 

a 



using c* of Corollary 2. 



5 Experiments 

In this section, we demonstrate the efficiency of the proposed method by exper- 
imental results. 



5.1 Estimation of the 1-Dimensional Data Region 



First, taking n samples from the uniform distribution 



r 1 (0 < a; < 1), 

^ 0 (otherwise) . 
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Table 1. The average loss (when a — A = 1.0) 





Theoretical 


a, T: known 


a, j 4: unknown 


n{a — a) 


0.306 


0.321(0.998) 


0.322(1.060) 


n\a — a| 


0.693 


0.703(0.777) 


0.760(0.806) 



Table 2. The average loss (when a = A = 2.Q) 





Theoretical 


a, T: known 


a, T: unknown 


^/n{a — a) 


0.053 


0.061(0.463) 


0.026(0.715) 


^/n\a — a| 


0.371 


0.373(0.282) 


0.572(0.430) 



we estimated the 
a = A = 1.0. If a 
is given by 



right endpoint a = 1.0. This is the case when the parameters 
and A are known, the estimator d that minimizes [|S~a|] 



a = 



max Xi + 

l<i<n 



log 2 
n 



from Corollary 1 in Section 3. We set n = 10000 and estimated the endpoint 
a = 1.0 using the above estimator d. 

Then, on the assumption that a and A were unknown, we estimated the 
endpoint a, estimating a, A simultaneously by using the method described in 
Section 4. In this case, we divided 10000 samples into 100 groups and obtained 
100 samples from the distribution of maximum value eq.(6). 

We simulated this estimation 2000 times respectively and calculated the av- 
erages of n(a — d) and n|a— a|. The results are presented in Table 1. The standard 
deviations are in parentheses. 

The other experiment was conducted in the case n samples were taken from 
the density 

r, s _ f 2 — 2x (0 < a; < 1), 

J(^) (otherwise). 



In this case the endpoint a = 1.0 and the parameters a = A = 2.0. If a and A 
are known, the estimator d that minimizes [|a ~ a|] is given by 



d = max Xi + 

l<i<n 

We estimated the endpoint a again using the above d and the d that was con- 
structed by estimating a and A simultaneously. The results of this case are given 
in Table 2. 

We see that in the case where a, A are known, those results strongly support 
our theoretical results. If we estimate the parameters a., A simultaneously, the 
average loss becomes larger compared with the one of the case a, A are known. 
However, in the case when a = A = 1.0, the results change little by estimating 
a and A simultaneously. 
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5.2 Estimation of the 2-Dimensional Data Region 

A multi-dimensional data region can also be estimated by using the proposed 
method. For given multi-dimensional data generated in some region, we consider 
a mapping from multi-dimensional space to M. The proposed method can be 
applied to 1-dimensional data obtained by applying this mapping to the given 
data. Using some mappings, some sets of 1-dimensional data are obtained. Then 
a multi-dimensional data region is estimated by applying the proposed method 
to each 1-dimensional data set. 

In order to investigate the effectiveness of the proposed method when it is 
applied to estimate a multi-dimensional data region, we conducted an experiment 
where a 2-dimensional data region was estimated. 

Given a set of d-dimensional data {xi, • • • , x„}, we considered following two 
procedures, (PI) and (P2), in which we made k mappings gi(x), • • • , 5 fc(x) from 
a datum x G IR'^. 

(PI): Generating vectors ai, • • • ,a^ G IR'^ randomly, define 5 i(x) by the inner 
product of the datum x and the vector a^, that is, 5 i(x) = (a^,x). 

(P2): Selecting k data Xij,---,Xij. from {xi,---,x„}, define 5 j(x) by the dis- 
tance between x and Xj^ in the kernel-induced feature space. That is, 
ffj(x) = ||<^(x)-(()(xij|| = (x, x) - 2 A:(x, Xi J -b a: (x^^. , Xi^. ) , where (j){x.) 

is a mapping from the input space to the feature space, || • || is the norm in 
the feature space and AT(x,y) = , <j>{y)) is a kernel function. 

We fixed the number of samples n = 200 and generated 2-dimensional data 
uniformly in the region D* defined by 

D* = {{x\ X 2 )’^ G x\ + 2x\x\ + X 2 ~ 2\/2{xiX2 -I- xi -I- X 2 ) + 4 < 0}. (9) 
Fig. 3 displays an example of this artificial data set. 




Fig. 3. Au example of the 2-dimensional data set 
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Table 3. The error rate in the estimation of the 2-dimensional data region 





the average errc 
Ca = Cb = 0 


ir rate(T_ 0+)(%) 
Ca,Cb ■ optimal 


fpjj 


3.54 (3.54 0.00) 


2.40 (1.59 0.81) 


(PS) 


2.21 (2.20 0.01) 


1.36 (1.06 0.30) 



We made 40 sets of 1-dimensional data using each procedure described above 
respectively. More specifically, in the first procedure (PI), vectors ai,---,a 4 o 
were generated uniformly on the unit circle, and in (P2), we used the polynomial 
kernel with degree 2, and picked out 40 data , • • • , that are closer to the 
average vector ^ 4'i^i) the feature space^. 

To each 1-dimensional data set {^^(xi), • • • ,gi(x„)}, we applied the proposed 
method and estimated the right endpoint a. The coefficient Ca was set to be op- 
timal c* described in Corollary 1. And the left endpoint b was estimated by 
applying the proposed method to data set {— ^^(xi), • • • , — gi(x„)} in the same 
way. For all data sets, 200 samples were divided into 40 groups and we obtained 
40 samples from the distribution of the maximum value eq.(6) to estimate pa- 
rameters a and A. Then we calculated the error rate on 10000 test samples. 
The test samples were generated on the lattice-shaped points in the region in- 
cluding the true data region D* in it. By doing so, we evaluated the difference 
between the estimated region and the true region D* by the square measure 
approximately. 

The error rate is calculated by the sum of the two types of the error. One 
is the rate of the test data that are in the estimated region but out of the true 
region D*. We denote this type of the error rate by 0+. And the other is the 
rate of the test data that are out of the estimated region but in D*. We denote 
this by T_. 

In order to investigate the effectiveness of the proposed method where the 
coefficients and Cf, were set to be optimal, we also calculated the error rate in 
the case when the coefficients Ca = Cf, = 0, that is, the maximum and minimum 
values were used directly to estimate the endpoints of the distribution of each 
1-dimensional data set. Table 3 shows the average of the error rate over 5 data 
sets in the two cases when the procedures (PI) and (P2) were used in extracting 
1-dimensional data sets. In parentheses, T_(left) and 0+ (right) are given. 

We see that by setting Ca and Cb to be optimal, though 0+ increases, the 
accuracy of estimation was improved with respect to the total error rate. 



^ However, the explicit expression of the mapping <^(x) is not reqnired since the map- 
ping is implicitly carried out and the norm in the feature space is computed by using 
only the corresponding kernel function K(x, y) = (1 -f (x, y))^. 
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6 Discussion and Conclusions 



In the previous sections, we evaluated U{\a — a\) as loss functions and derived the 
optimal estimators for the loss functions. These loss functions are symmetric with 
respect to a being greater(overestimation) or smaller(underestimation) than a. 
In some situations, it is more suitable to use loss functions that are not symmetric 
with respect to that. Let us consider the case that we use such asymmetric loss 
functions. Let l\ and I 2 be positive constants and function L{x) be 



L{x) 



—lix {x < 0), 

I 2 X {x > 0). 



Redefining the average loss by Ex^ yL{d 
theorem in the same way. 



a) 



, we also obtain the following 



Theorem 5. Suppose Xi, - ■ ■ , Xn are independently and identically taken from 
density f{x), and if the density function f{x) satisfies the conditions (i),(ii) in 
Section 3, then 



E 






L{d — a) 



I2 



».-(5)--{(i + r)7(I.-c2)-rr(l)) 

^ a lo a a h a 



+o{A) 



And denoting the optimal coefficient Ca that minimizes this average loss by c*, 



c 



* 

a 





In this paper, we proposed a method to estimate the data region using the 
maximum and minimum values in the samples. We calculated the asymptotic 
distributions of these values and derived the optimal estimators for given loss 
functions. By extracting 1-dimensional data, the proposed method has abundant 
practical applications. It can also be applied effectively to estimate a multi- 
dimensional data region by using some mappings to IR^. We demonstrated this 
by the experimental study on 2-dimensional data in the latter half of Section 5. 
The shape of the estimated region and the accuracy of the estimation depend 
on the selection and the number of mappings to extract 1-dimensional data sets 
from the multi-dimensional data. To maximize the effectiveness of the proposed 
method, questions like the selection of those mappings and the optimization of 
the number of them have to be addressed. 
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Appendix 

Proof of Theorem 1 

Proof. From assumptions (i) and (ii), in a neighborhood of a we have 
f{x) = A{a — -I- o((a — 



and 

Putting 
for X < 0, 



1 — F{x) = —(a — x)“ -I- o((a — a;)“). 



/A , — 1 /q, 

Un =(—n) X + a, 
a 



Pr{(— n)^/“(M„ - a) <x} = F(unY 
a 



= {1 - (1 - C («..)))” 

= {i-UtL + 

n n 

— exp(— (— x)“) (n— >-oo). 

Since 1 — F{un) = 0 for x > 0, we obtain G'max(a:). By a similar argument, we 
also obtain Gmin(x). □ 



A , — l/a 

—n) s + a, Vn = 
a 



Proof of Theorem 2 

Proof. Putting 
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we have 

Pr{((— - a) < s, - b))< t} 

a (3 

= F{UnT - {F{Un) - F{Vn)Y. 

It follows from Theorem 1 that 

exp(-(-s)“) (n-;>oo), 

and that 



{FK) - FK)}" = {!-(!- F(u„)) - FK)}” 

^ n n n ^ 

— >■ exp(— (— s)“ — (n — >■ oo). 



Then we have 



G„(s, t) — >■ exp(— (— s)“)(l — exp(— t^)) (n — >■ oo). 



( 10 ) 



□ 



Proof of Theorem 3 

Proof. We define 

t = a — n + Ca), 

— 1,1 , 
t = a — n “ (— t -I- Ca), 

then n^/“(a — a)^ < t is equivalent to 

max Xi <t 

l<i<n 

for any odd number k > 0, and — a|* < t is equivalent to 

t < max Xi <t 

l<i<n 



for any natural number k. 

Hence 

Hn{t) 

and 

Hn{t) = I 



\F(ir {t<cD, 
l-F(t)" it>cl), 

F(ir-F{tr {t<ci). 



Since, in a neighborhood of a, we have 



1 - F{x) 



— {a - x)" + o((a - 
a 
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we obtain 



F{tr = {l---(t^/'= + Ca)“+0(-)}” 

^ n a * 

^exp(--(ca + t^/'=)“) (n^oo) 



and 



Far = {i---M'/"+ca)“+o(-)}” 

^ n a * 

exp(- — (Ca - (n-)>oo). 

a 



Thus we have eq.(4) and eq.(5). 



□ 



Proof of Theorem 4 

Proof. If k is an odd number, 



Exr. 



r'=/“(a-a)'= 



c5- 



= c^- 



A 



A 



exp( {ca — (n — >■ oo) 



f —til 

{ca-(^i) } i^t) e dt. 



'T 



For natural number k 
Ex 






cj- 



exp( (ca-< /'^) )dt 

Jo ^ 

A 

+ / exp( (ca + (n — >■ oo) 

Jo ct 



/o 
k 



Ca ^ 



JO 



{ca-(^t)“} (^t)“ e dt 



>k-urAi-r-t. 

^A 



Expanding (ca— ^ completes the proof. 



□ 
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Abstract. We consider the Maximum Entropy principle for non-ordered 
data in a non-probabilistic setting. The main goal of this paper is to de- 
duce asymptotic relations for the frequencies of the energy levels in a non- 
ordered sequence = [u>i, . . . , tujv] from the assumption of maximality 

N 

of the Kolmogorov complexity K{u>^) given a constraint = NE, 

i = l 

where E is a number and / is a numerical function. 

1 Introduction 

Jaynes’ Maximum Entropy Principle is the well-known method for inductive in- 
ference and probabilistic forecasting in the case when some a priori constrains for 
the mathematical expectation and other moments are given (Jaynes [3], Cover 
and Thomas [1], Section 11). Extremal relations between the cost of the infor- 
mation transmission and the capacity of the channel were considered in [11] 
(Chapter 3). This principle was originated in statistical physics for computation 
the numerical characteristics of the ideal gases in the equilibrium state (Landau 
and Lifshits [8]). 

Let /(a) be a function taking numerical values at all letters of an alphabet 
B = {oi, . . . om}- For each collection of letters wi, . . . ,ujn from the alphabet B 
we consider the sum Eili /(^d- The value /(wi) can have various physical or 
economic meanings. It may describe the cost of the element Ui of a message or 
a loss (fine) under the occurrence of the event uji. In thermodynamics, /(wi) is 
the energy of a particle or volume element in the state Ui . 

In contrast to [I], [3], [II], we consider non-ordered sequences, or bags, to be 
more concise, we consider the well known in statistical physics Bose - Einstein 
model [8], [10] for a system of N indistinguishable particles of n types. This type 
of data is typical for finance, where non-ordered and indistinguishable collections 
of items (like banknots or stocks) are considered. 

In this work, we do not assume the existence of any probabilistic mechanism 
generating elements of the collection wi, . . . , wat. Instead of this, we consider 
combinatorial models for describing possible collections of outcomes wi, . . . , wat 
typical for statistical mechanics; more precise, we assume that the collection 
of outcomes under consideration are “chaotic” or “generic” elements in some 
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simple sets. We refer to this collection as to a microstate (the correct definition 
see below). The notions of chaoticity and simplicity are introduced with the use 
of the algorithmic complexity (algorithmic entropy) introduced by Kolmogorov 
in [4]. 

The main goal of this paper is to deduce asymptotic relations for the fre- 
quencies of the energy levels in a microstate = [wi, . . . , wat] from the as- 
sumption of maximality of the Kolmogorov complexity K(w^) given the cost 

N 

= X) fi^i) of tflo message (microstate) uj^ . 

i=l 

2 Preliminaries 

We refer readers for details of the theory of Kolmogorov complexity and algo- 
rithmic randomness to [9]. In this section we briefly introduce some definitions 
used in the following. 

Kolmogorov complexity is defined for any constructive (finite) object. Any set 
of all words in a finite alphabet is the typical example of the set of constructive 
objects. The definition of Kolmogorov complexity is based on the theory of 
algorithms. Algorithms define computable functions transforming constructive 
objects. Let B{p,y) be an arbitrary computable function from two arguments, 
where p is a finite binary word, and y is a word (possibly) in a different alphabet. 
We suppose also that the method of decoding is prefix: if B{p,y) and B{p',y) 
are defined then none of the words p and p' is an extension of the other ^ . The 
measure of complexity (with respect to B) of a constructive object x given a 
constructive object y is defined 

KB{x\y) = mm{l{p) \ B{p,y) = x}, 

where l{p) is the length of the binary word p (we set min0 = oo). We consider 
the function B{p,y) as a method of decoding of constructive objects, where p 
is a code of an object under a condition y. An optimal method of decoding 
exists. A decoding method B{p,y) is called optimal if for any other method of 
decoding B'{p,y) the inequality KB{x\y) < KB'{x\y) + 0(1) holds, where the 
constant 0(1) does not depend on x and y (but does depend on the function 
B')^ . Any two optimal decoding methods determine measures of complexity 
different by a constant. We fix one such optimal decoding method, denote the 
corresponding measure of complexity by K(x|p), and call it the (conditional) 
Kolmogorov complexity of x with respect to y. The unconditional complexity of 
the word x is defined as K(a;) = K(x|A), where A is the empty sequence. 

^ The original Kolmogorov’s definition did not used the prefix property. 

^ The expression f{xi, . . . ,Xn) < g{xi, . . . ,x„) + 0(1) means that there exists a 
constant c such that the inequality f(xi,...,x„) < g{xi, . . . ,Xn) + c holds for 
all xi,...,Xn- The expression f(xi, . . . ,x„) = g{xi, . . . ,x„) -1-0(1) means that 
f{xi,. . . ,x„) < g{xi, . . . , x„) + 0(1) and g{xi,. . . ,x„) < f{xi, . . . ,x„) + 0(1). In 
the following the expression F{N) — 0{G{N)) means that a constant c exists (not 
depending on N) such that |T(A)| < cG{N) for all N. 
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We will use the following relations that hold for the prefix complexity of 
positive integer numbers n (see [9]). For any e > 0 

iF(n) < logn + (1 + e) loglogn + 0(1). (1) 

Besides, 

K(n) > logn + loglogn (2) 



for infinitely many n. 

Given the list of all elements of a finite set D we can encode any specific 
element of it by a a binary sequence of length^ [log |0|] . This encoding is prefix- 
free. Then we have 

K{x\D) < log|0| -hO(l). 

Moreover, for any c > 0, the number of all x G D for which 

K(x|L>) < log |0| - c, (3) 

does not exceed i.e. the majority of elements of the set D have conditional 

Kolmogorov complexity close to its maximal value. Kolmogorov [6], [7], defined 
the notion of the deficiency of algorithmic randomness of an element a; of a finite 
set D of constructing objects 

d{x\D) = log \D\ — K(x|Zl). 

Let Randm(U) = {x G D \ d{x\D) < m} be the set of all m-random (chaotic) 
elements of D. It holds |Randm(71)| > (1 — 2“'")|U|. 

Let X he a finite set of constructive objects and S'(a) be a computable 
function from X to some set of constructive objects. We refer to T as to a 
sufficient statistics (similar ideas can be found in [2], [9]). We will identify the 
value S{a) of sufficient statistics and its whole prototype Ei[a] = S'^^(S’(o;)). 
So, we identify the sufficient statistics Ei and the corresponding partition of the 
set X . We refer to the set E[cx\ as to a macrostate correspondent to a microstate 
a. 

We will use the following important relation which is valid for prefix Kol- 
mogorov complexity. 

K(a;, y) = K(y) -h K{x\y,K{y)) + 0(1), (4) 

where x and y are arbitrary constructive objects [9]. 

Since K(a, S'(a)) = K(a)-|-0(1), we obtain from (4) a natural representation 
of the complexity of a microstate through its conditional complexity with respect 
to its macrostate and the complexity of the macrostate itself 

K(a) = K(a|H(a),K(H(a))) + K(H(a)) + 0(l). (5) 

® In the following logr denote the logarithm of r on the base 2, [r] is the minimal 
integer number > r, \D\ is the cardinality of the set D. 
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The deficiency of randomness of a microstate a with respect to a sufficient 
statistics S'(o;) (or to the corresponding partition) is defined 

dsioi) = log |S'[a]| - K(a| “'(a),K(S’(a))) + 0(1). (6) 

By definition for any a G S' it holds K(a|S, K(S)) < log |S| + 0(1), where S is 
an arbitrary element of the partition. We have ds{o:) > —c for all a G ff, where 
c > 0 is a constant. Besides, for any m > 0 the number of all a G fb such that 
ds{oi) > m does not exceed 2“'"|A’|. 

By (6) the following representation of the complexity of an element a G df is 
valid 



K(a) = log |S[a]| + K(S(a)) - ds{a) + 0(1). (7) 

Let B = {oi, . . . , qm} be some finite alphabet, /(a) be a function on B, and 
let £i,...,e„ be all its different values (energy levels); we suppose that these 
values are computable real numbers. We also suppose that 3 < n < M. Define 
Gj = {a : f{a) = ej}, = \Gj\, j = 1, . . . n. It holds J2]=i 9j = M. 

Let us consider the set of all non-ordered sequences (bags or multi-sets) 
= [wi, . . . , Wat], which can be obtained from the set B^ by factorization 
with respect to the group of all permutations of the elements wi, . . . ,ujn- Any 
multi-set can be identified with the constructive object - M-tuple 

< (ai,fci),...,(aM,fcM) >, (8) 

where each letter is supplied by its multiplicity ki = \{j : tOj = ai}|, i = 
1 , . . . , M. The size of this multi-set is equal to the sum of all multiplicities 

N = Y.f=,K 

Let 77]^ a set (a simplex) of all n-tuples of nonnegative integer numbers 
(fVi, . . . , Nn) such that X^r=i Ni = N. A sufficient statistics on with 

the range in 7T]^ is defined as follows. Put = {Ni, . . . , Nn), where 

N, = = \{j :l<j<N, f{cOj) = e,}| (9) 

for i = This means that the element of the corresponding 

partition of the set B^ consists of all non-ordered sequences = [wi, . . . ,ojn] 
satisfying (9). In other words, it consists of M-tuples < (m, 7i), . . . (aM,kM) > 
such that 

E % = 

dj^Gi 

for i = 1, . . . , n. By definition 
Therefore, we have 



gi + Ni-1 
9i 



9n + Afn — 1 
9n 






(10) 
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The number p = N/M is called density. Let pi = {gt — l)/M, i = 1,. . . ,n. We 
will consider asymptotic relations for N — >■ oo, M — >■ oo such that N/M — >■ p > 0; 
the numbers p and pi, . . . ,p„ are supposed to be arbitrary positive constants. 

Let u)^ € = Ni/N, i = 1,. . . ,n. By Stirling formula 

and by (6), (7) we obtain 

K{iv^) = NH{iy^,...,n„)- 

- E b + 0(1), (11) 

^ -T Qi i 

where the leading (linear by N) member of this representation 

(12) 

n 

= + PtP~^) ^Og{vi +Pip~^) - VilogVi - p^p~^\og{pip~^)) 

i=l 

is called the Bose entropy of the frequency distribution {ui, . . . , Vn). Notice that 
(12) tends to Shannon entropy of this frequency distribution when p — >■ 0. 

The definition of the Bose entropy on the basis of Kolmogorov complexity 
gives us some justification of this notion. 

Let the mean cost E of the transmission of a letter is given (we suppose that 
if is a computable real number) . Denote by the set of all microstates of 
size N satisfying 

n 

\Y^e,N,{uj^)-EN\<C, (13) 

i=l 

where C is a constant. For C sufficiently large the number of all n-tuples 
(iVi, . . . , Nn) G ii]v> satisfying (13), is of order 0(iV”“^). 

The maximum Elmax of the Bose entropy (12) given constrains 

n n 

Vie^ = E, j/j = 1 (14) 

i=l i=l 



is computed using the method of Lagrange multipliers and reached at 

~ _ Pi 

2^^i+n-p^ 



(15) 



where A and p are defined by equations (14) (see also [8]). 

Outside the field of thermododynamics the Bose - Einstein model can be 
applied to financial problems. For example, we can consider n groups Gi of 
stocks, where i = 1, . . . ,n. Each stock of the group Gi is of one of gt types 
and brings the same income e^. We would like to predict the market equilibrium 
distribution of N stocks by values of income to receive the mean income E. This 
distribution is given by (15). 

Jaynes’ [3] “entropy concentration theorem” says that for any e > 0 the 
portion of all (ordered) sequences or words . . . wat of the alphabet 
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of the length N satisfying (14) and such that — > e for some 1 < i < n, 

does not exceed e~‘^^ , where (£^i, . . . , ?■„) is the point at which the maximum of 

n 

the Shannon entropy H = ^ —Vilogv^ given constrains (14) is attained, and c 

i—1 

is a constant depending of e. 

An analogue of the Jaynes’ theorem holds for the set of all non-ordered 
sequences . 

Theorem 1. For any e > 0 the portion of all microstates from such that 

n 

- Vif > e (16) 

i^l 

does not exceed e~'^^ , where Vi = and the numbers Vi are defined by (15), 

i = 1, . . . , n. Flere the constant c depends on e. 

Proof. Let (iVi, . . . , A^„) G lJ)f satisfies (13). Put Shi = Vi — hi. It is easy to 
calculate the variation of the Bose entropy inside the layer at the point of 
its maximum 



5H{hi, ...,hn) = H{vi, ...,Vn)- H{hi,. ..,?■„) = 

{ShiY 



= -E 



^ h,{l + phi/pi) 



o{{5hi)^). 



(17) 



Let us suppose that (16) holds for all sufficiently large N . Then by (17) we 
have 

H ^ Hmax ce, 

where FI is the Bose entropy (12) of the frequency distribution Vi of the mi- 
crostate uj^ , i = l,...,n, and c is a positive constant. Analogously to the 
Bernoulli case it is easy to see that the number of all microstates of size N with 
the entropy FI is equal to The total number of all macrostates 

satisfying (13) is of order 0(A^"“^). Hence, the number of all microstates of size 
N with entropy FI < Hmax — ce decreases exponentially. □ 



3 Asymptotic Relations for Frequencies of Energy Levels 



The Jaynes’ concentration theorem can be generalized to an asymptotic form. 
Theorem 2. There exists a sequence of sets An C , N = 1,2 .. ., such that 



lim (|A^|/|£^|) = 1 
N—yoo 



and such that for any sequence of microstates G the following asymptotic 
relation holds 



n 



lim 

N—¥oo 






i)" = 0, 



(18) 
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where the numbers Vi are defined by (15) for i = 1, . . . , n. 

A pointwise version of this theorem can be formulated in terms of Kolmogorov 
complexity. Let <j{N) be a nondecreasing numerical function such that a{N) = 
o{N), and co^ , N = 1,2 be a sequence of microstates such that co^ G 
and 

> NH^a. - a{N) 

for all sufficiently large N. Then the asymptotic relation (18) holds. 

Proof To prove the first part of the theorem define .4 at be equal to the set of all 
uj^ such that jr'j — hi] < for alH = 1, . . . iV. 

Let us consider the pointwise version of the theorem. By (11) and (17) we 
have 



K(w"^) = NH^acc - 



2 = 1 



— — — r — -n log N — 

Nvi{l + pvi/pi) 2 



-ds{uj^^) + K(iVi, . . . , fV„|iV, K(iV)) + K{N) + o{N{5i),Y). 



(19) 



By (19) and by condition this theorem we obtain 



- hf = 0((logiV)/iV) +0(1) = 0(1). 

i=l 

We also take into account that nearby the maximum point (15) the second 
member in (11) can be approximated by ^nlogiV+0(l). The theorem is proved. 
□ 

The following theorem is the starting point for several following theorems 
presenting more tight asymptotic bounds for the frequencies. These bounds are 
presented in the “worst” case. 

Theorem 3. It holds 



max K(u;^) = NHmax-logN + K{N) + 0(l). (20) 

The proof of this theorem is given in Section 4. 

The bound (20) is noncomputable. Two computable lower bounds, the first 
one holds for “most” N, the second one, trivial, holds for all N, are presented 
in Corollary 1; they also will be used in the definitions (22) and (24) below. 

We will consider limits by the base 

BY = {N : N > L, K(iV) > log iV - m}. 



where m, L = 0, 1, . . .. It is easy to see that S£Yi — 



lim lim inf 

m— >-oo N—^oo 



\BYn{L,...,L + N-l}\ 
N 



= 1 



for all m and L. 




228 V. Maslov and V. V’yugin 

Indeed, taking into account that the number of all programs of length < 
log N' — m, where N' < L + N — 1, does not exceed + fV — 1), we obtain 



\Bfn{L,...,L + N-l}\ 
N 



> 1 _ 2-™+i 



for all sufficiently large N. 

Corollary 1. For any e > 0, m > 0 and N G S™ 

N Hmax - m - 0{l) < max < N Fl^ax + 

+(l + e)loglogA^ + 0(l). (21) 

The upper bound in (21) is valid for all N. A trivial lower estimate which is 
valid for all N is 



max K(w^) > IV - log IV - 0(1). 

The bound (21) can be obtained by applying (1) to (20). 

Let cr be a nonnegative number. Let us define the set of all a-random mi- 
crostates in the layer (13) 

Rand(S, cr, N) = G K(u;^) > NH^ax ~ a}. (22) 

By Corollary 1 for all m and N G Bdf the set Rand(if , a, N) is nonempty when 
cr > m + 0(l) and contains all microstates co^ G at which the maximum 
of the complexity given constrains (14) is attained. In the following we suppose 
that cr = (t{N) = o(loglog A^). 

Theorem 4. For any m and for any nondecreasing numerical function cr(iV), 
such that cr(A^) > m + 0(1) for all N and a{N) = o(loglogiV), the following 
asymptotic relation holds 



limsup sup ^ = 1, (23) 

B'S \ ^/ N Vi{l + pVi / Pi)\og\og N j 

where the numbers Ni = Ni{uj^) are defined by (9), and the values Vi are defined 
by (15) for i = 1, . . . ,n. 

The proof of this theorem is given in Section 4. 

Let us consider the second computable lower estimate for the maximum of 
the complexity given constrains (14). Let us define 

Rand'(A;, ct, N) = G K(w^) > NH^ax -logN-a}. (24) 



The following theorem holds. 
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Theorem 5. For any nonnegative number a 

limsup sup Vf , N,-Nh \ 

N^oo w«GRand'(B,<T,Ar) + pi>i/pi)K{N) j 

where the notations are the same as in Theorem 4- 

See the sketch of the proof of this theorem in Section 4. 

The following corollary asserts that the maximum of the complexity is at- 
tained on a “random” microstate with respect to a “random” macrostate. For 
any m > 0 and JV define 

Max(if, m, N) = {uj^ : K{uj^) > max^ K(a^) — m}. 



Corollary 2. (from Theorem 3). For any m > 0 and G Max{E,m, N) 
relations dsito^) = 0(1) and K(7Vi, . . . , Nn\N, K(iV)) = |(n — 2) log N + 0(1) 
hold for all N, where S = S{uj^). 

Also, the following relation holds 



n 



lim sup sup 

N—>-oo uj^ ^Max{E,m,N) 



E 



N, - Nm 



y/Nhi{l + pVi/pi) 



= 1 , 



where we use the same notations as in Theorem 4- 
The proof of this corollary see in Section 4. 
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4 Appendix: Proofs 



Proof of Theorem 4. Let be a sequence of microstates such that G 
for all N and 

K(w^) > NHmax - o(logloglV). 

Denote S — S{co^). Let e > 0 be a sufficiently small number. By (1) 

K{N) < log fV + (1+ ^e) log log TV + 0(1). 

By this inequality and by (19) we obtain 




)E 



{Nj - Nhf 

Nvi{l + ph/Pi) 



<K(iVi,...,A^„|7V,K(iV))- 



1 

2 



(n 



2) log TV + (1 + -e) loglogfV + oiloglogN). 



(25) 



Suppose that for some positive integer number L the following inequality holds 



(A — ^ 

^ NPi{l + pvi/Pi) ~ 



^e) + o(loglogiV). 



(26) 



Since each number Vi, i = 1, . . . ,n, is computable, using relation (26) we can 
effectively find the corresponding interval of integer numbers with center in NPi 
containing the number A . The program for A is bounded by the length of this 
interval and by the number i < n. So, we have 

K{N,\L,N)< ^logiV+ilogL + o(loglog7V). (27) 

Hence, taking into account constrains (14) on {Ni, . . . ,Nn), we obtain the in- 
equality 



K{Ni,...,N„\L,N)-^{n-2)logN < 

< ^ (n - 2) log L + o(log log N) . 

Using the standard inequality K{x\L) > K(a;) — K{L) — 0(1), we obtain 
K(lVi, . . . , A^„|iV,K(lV)) - i(n - 2) log iV < 

< ^nlogL + o(loglogfV). (28) 

Since (26) holds for L equal to the maximum of the integer part of the number 
K( , . . . , fV„ I iV, K(iV) ) - i (n - 2) log + ( 1 + ^ e) log log 
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and the number one, we obtain by (28) 

L < ^nlogL + (1 + ^e) log log N + o(loglog N). (29) 

Then we have 

K(lVi,...,iV„|lV,K(lV))- i(n-2)logiV + (l + ie)loglog7V) < 

< (1 + ^e) logloglV + o(logloglV). (30) 

By (25) we have for any e > 0 



Hence, 



lim sup 



sup 



E 



(iV, - Ni>,f 



uj^€Rand(E,a(N),N)~^ ( 1 + pf'i/Pi) log log iV 



< (1 + e) (32) 



for any e > 0. Since e is an arbitrary positive real number, we can replace 1 + e 
in (32) on 1. 

Now, let us prove that the lower bound in (23) is also valid. The simplex 
and the layer contain the center of the ellipsoid 



E 



(TV, - jVz>,)" 

Ni>i{l + pi>i/pi) log log N 



< 1 , 



(33) 



considered in the n-dimensional space 7^". Let e be a sufficiently small positive 
real number. The volume of the layer 



1-^E 



{Nj - Nhf 

iVj>i(l + pvi/ Pi) log log N 




(34) 



is equal to c(e)V, where V is the volume of the whole ellipsoid (33), and the 
constant c(e) depends on e, but does not depend on N. An analogous relation 
holds for the volume of intersection of the simplex and the layer with 
the ellipsoid (33) and for the volume of analogous intersection of these two 
sets with the layer (34), correspondingly. Since the volume of any ellipsoid is 
proportional to the product of lengths of its semi-axes, the total number of 
all n-tuples {Ni, . . . ,Nn) locating in the intersection of the layer (34) with the 
simplex and the layer is proportional to (A^loglogN) 2 ("~ 2 )^ Choose 
an n-tuple (iVi, . . . , N„) of the general position locating in this intersection. We 
have 

K(iVi , . . . , iV„ |7V, K(iV) ) = i (n - 2) log + log c(e) + o(log log N) 
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for this n-tuple. Let ^ be the corresponding macrostate. Then for 

any uj^ G E such that ds{ui^) = 0(1) the following inequality holds 

K(w^) > NHmax - (1 - ^e)loglogiV- inlogfV + 

+K(iVi, ...,Nr,\N, K{N)) + K{N) + 0(log log N) = 

= NHmax + N{N) - log TV - log log log log fV + 

+ logc(e) + o(loglogfV). (35) 

We have for this n-tuple (A^i, . . . , fV„) of general position 

y" (A^i - Ni>,)^ > 1 - e (36) 

^lVn,(l + pn,M)loglogiV - ' ^ ^ 

By (2) the inequality K(A^) > log N + log log N holds for infinitely many N. By 
this property and by (35) for any e > 0 and for any microstate of sufficiently 
large size N chosen as above inequalities (36) and K(w^) > NHmax hold. The 
needed lower bound follows from these inequalities. □ 

Sketch of the proof of Theorem 5. The proof of this theorem is similar 
to the proof of Theorem 4, where in the second part for any e > 0 we can take an 
n-tuple (A^i, . . . , Nn) of the general position which is located in the intersection 
of the layer 






{Nj - Nvj)^ 
Nvi{l + pvi/pi)K{N) 



< 




(37) 



with the simplex and the layer i.e. such that 

K(A^i,...,A^„|Af,K(Af)) > l(n-2)logA^ + 0(l). 

□ 

Proof of Theorem 3. Let us consider an n-tuple (A^i , . . . , Nn) of the general 
position located in the intersection of the layer 



1 _ g < (-^» - Nvi)"^ 




(38) 



with the simplex P)(f and the layer , i.e. such that 

K(Afi,...,A^„|Af,K(A^)) > l(n-2)logA^ + 0(l), 

where E is the corresponding macrostate. In this case, we have by (19) that for 
any microstate G E such that = 0(1) the following inequality holds 

max K(u;^) > A^iJmos — log A^ -I- K(A^) — 0(1). (39) 
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The left - hand part of the relation (20) follows from this inequality for all 
N G 

Assume that the maximum of the Kolmogorov complexity given constrains 
(14) is attained on a microstate uj^ . As was proved above, in this case the 
inequality 

K(w'^) > NH^acc -logN + K(A^) + 0(1) 
holds. By the proof of the first part of Theorem 4 we obtain 

K(iVi, . . . , fV„|iV, K(fV)) - ^(n - 2) logiV = 0(1). (40) 

Applying (19) and taking into account (40) we obtain 

K(w^) < - logiV + K(iV) + 0(1). 

The right - hand inequality of (20) follows from this relation. Theorem is proved. 

□ 

Proof of Corollary 2. By 

K(w^) = NHmacc -logN + K(A^) + 0(1) 
and representation (11) it is easy to obtain 

n 

+ pi>i/p,)) = 0(1), 

i=l 

and also (40). Then by (11) we obtain the assertion of the corollary. □ 
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Abstract. Solomonoff’s central result on induction is that the posterior 
of a universal semimeasure M converges rapidly and with probability 1 
to the true sequence generating posterior /i, if the latter is computable. 
Hence, M is eligible as a universal sequence predictor in case of un- 
known /i. Despite some nearby results and proofs in the literature, the 
stronger result of convergence for all (Martin-L6f) random sequences re- 
mained open. Such a convergence result would be particularly interesting 
and natural, since randomness can be defined in terms of M itself. We 
show that there are universal semimeasures M which do not converge 
for all random sequences, i.e. we give a partial negative answer to the 
open problem. We also provide a positive answer for some non-universal 
semimeasures. We define the incomputable measure D as a mixture over 
all computable measures and the enumerable semimeasure IT as a mix- 
ture over all enumerable nearly-measures. We show that W converges to 
D and D to p on all random sequences. The Hellinger distance measuring 
closeness of two distributions plays a central role. 



1 Introduction 

A sequence prediction task is defined as to predict the next symbol Xn from an 
observed sequence x = xi...Xn-i- The key concept to attack general prediction 
problems is Occam’s razor, and to a less extent Epicurus’ principle of multiple 
explanations. The former/latter may be interpreted as to keep the simplest/all 
theories consistent with the observations xi...Xn-i and to use these theories to 
predict Xn- Solomonoff [Sol64,Sol78] formalized and combined both principles in 
his universal prior M which assigns high/low probability to simple/complex envi- 
ronments X, hence implementing Occam and Epicurus. Formally it is a mixture 
of all enumerable semimeasures. An abstract characterization of M by Levin 
[ZL70] is that M is a universal enumerable semimeasure in the sense that it 
multiplicatively dominates all enumerable semimeasures. 

* This work was partially supported by the Swiss National Science Foundation (SNF 
grant 2100-67712.02) and the Russian Foundation for Basic Research (RFBR grants 
N04-01-00427 and N02-01-22001). 
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Solomonoff’s [Sol78] central result is that if the probability of 

observing Xn at time n, given past observations x\...Xn-i is a computable func- 
tion, then the universal posterior Mn '■= M{xn\x\...Xn-i) converges (rapidly!) 
with ^-probability 1 (w.p.l) for n— >-oo to the true posterior y^„:=/i(x„|a:i...a:„_i), 
hence M represents a universal predictor in case of unknown “true” distribution 
p.. Convergence of M„ to Hn w.p.l tells us that M„ is close to for suffi- 
ciently large n for almost all sequences xiX2---- It says nothing about whether 
convergence is true for any particular sequence (of measure 0) . 

Martin-L6f (M.L.) randomness is the standard notion for randomness of in- 
dividual sequences [ML66,LV97]. A M.L.-random sequence passes all thinkable 
effective randomness tests, e.g. the law of large numbers, the law of the iterated 
logarithm, etc. In particular, the set of all /x-random sequences has ^-measure 
1. It is natural to ask whether M„ converges to /i„ (in difference or ratio) indi- 
vidually for all M.L.-random sequences. Clearly, Solomonoff’s result shows that 
convergence may at most fail for a set of sequences with /x-measure zero. A 
convergence result for M.L.-random sequences would be particularly interesting 
and natural in this context, since M.L. -randomness can be defined in terms of 
M itself [Lev73]. Despite several attempts to solve this problem [Vov87,VL00, 
HutOSb], it remained open [Hut03c]. 

In this paper we construct an M.L.-random sequence and show the existence 
of a universal semimeasure which does not converge on this sequence, hence an- 
swering the open question negatively for some M . It remains open whether there 
exist (other) universal semimeasures, probably with particularly interesting ad- 
ditional structure and properties, for which M.L. -convergence holds. The main 
positive contribution of this work is the construction of a non-universal enumer- 
able semimeasure W which M.L. -converges to p, as desired. As an intermediate 
step we consider the incomputable measure D, defined as a mixture over all 
computable measures. We show posterior M.L. -convergence of IT to D and of D 
to /X. The Hellinger distance measuring closeness of two posterior distributions 
plays a central role in this work. 

The paper is organized as follows: In Section 2 we give basic notation and re- 
sults (for strings, numbers, sets, functions, asymptotics, computability concepts, 
prefix Kolmogorov complexity), and define and discuss the concepts of (univer- 
sal) (enumerable) (semi)measures. Section 3 summarizes SolomonofPs and Gacs’ 
results on posterior convergence of M to /x with probability 1. Both results can 
be derived from a bound on the expected Hellinger sum. We present an im- 
proved bound on the expected exponentiated Hellinger sum, which implies very 
strong assertions on the convergence rate. In Section 4 we investigate whether 
convergence for all Martin-L6f random sequences hold. We construct a universal 
semimeasure M and an ^-M.L.-random sequence on which M does not converge 
to /X for some computable /x. In Section 5 we present our main positive result. 
We derive a finite bound on the Hellinger sum between p and D, which is ex- 
ponential in the randomness deficiency of the sequence and double exponential 
in the complexity of p. This implies that the posterior of D M.L. -converges to 
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/i. Finally, in Section 6 we show that W is non-universal and asymptotically 
M.L.-converges to L>. Section 7 contains discussion and outlook. 

2 Notation and Universal Semimeasures M 

Strings. Let i,k,n,t€ IN = {1,2,3,...} be natural numbers, A’* = lJ^pA’” 

be finite strings of symbols over finite alphabet X 9 a,b. We denote strings x of 
length l{x) = n by x = xiX2...Xn&X"‘ with Xt&X and further abbreviate Xk-,n'= 
XkXk+i...Xn-iXn for k<n, and a;<„ and e = x^i=Xn+i-.n&X° = {e} 
for the empty string. Let uj = xi,oo G X°° be a generic and a G X°° a specific 
infinite sequence. For a given sequence xi,oo we say that Xt is on-sequence and 
Xt yf Xt is off-sequence. x[ may be on- or off-sequence. We identify strings with 
natural numbers (including zero, A’*=IVU{0}). 

Sets and functions. (Q, M, := [0,oo) are the sets of fractional, real, and 
non-negative real numbers, respectively. denotes the number of elements in 
set S, ln() the natural and log() the binary logarithm. 

Asymptotics. We abbreviate lim„_>oo[/(^) — 5 (^)] = 0 by /(n) g{n) and 
say / converges to g, without implying that lim.„^oog{n) itself exists. We write 
f{x) <g(x) for f{x) = 0{g{x)) and f{x) ^ g(x) for f{x)<g{x)+0{l). 
Computability. A function f :S^ MU{oo} is said to be enumerable (or lower 
semi-computable) if the set {{x,y) : y < f{x),x€S,y€(Q} is recursively enumer- 
able. / is co-enumerable (or upper semi-computable) if [— /] is enumerable. / is 
computable (or estimable or recursive) if / and [— /] are enumerable. / is approx- 
imable (or limit-computable) if there is a computable function y.SxlN^lR with 
lmi„^oogix,n) = f{x). The set of enumerable functions is recursively enumerable. 
Complexity. The conditional prefix (Kolmogorov) complexity K(x\y) := 
mm{£{p):U{y,p)=x halts} is the length of the shortest binary programme {0,1}* 
on a universal prefix Turing machine U with output x G X* and input y G X* 
[LV97]. K{x) := K{x\e). For non-string objects o we define K{o) := K{{o)), 
where (o) G X* is some standard code for o. In particular, if (/i)"=i is an 
enumeration of all enumerable functions, we define K(fi) = K{i). We only 
need the following elementary properties: The co-enumerability of K, the 
upper bounds K{x\^{x))'^ ^{x)\og\X\ and A(n)^21ogn, and K{x\y)^K{x), 
subadditivity K{x)^ K{x,y)^ K{y) + K{x\y), and information non-increase 
K{f{x)) ^ K{x) + K{f) for recursive f : X* =>■ X* . 

We need the concepts of (universal) (semi)measures for strings [ZL70]. 

Definition 1 ((Semi)measures). We call v : X* [0,1] a semimeasure if 
v{x) ^ ® (probability) measure if equality holds and 

iy{e) = l. v{x) denotes the v-probability that a sequence starts with string x. Fur- 
ther, i^(ajx) := is the posterior v-probability that the next symbol is a£X, 
given sequence x€X*. 

Definition 2 (Universal semimeasures M). A semimeasure M is called a 
universal element of a class of semimeasures Ai, if 
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M G M and \lv G M. 3wi, > 0 : M{x) > Wu-v{x) 'dx G X* . 

From now on we consider the (in a sense) largest class M. which is relevant from 
a constructive point of view (but see [Sch02,Hut03b] for even larger construc- 
tive classes), namely the class of all semimeasures, which can be enumerated 
(=effectively be approximated) from below: 

Ai := class of all enumerable semimeasures. (1) 

Solomonoff [Sol64, Eq.(7)] defined the universal posterior M (x\y) = M (xy) / M (y) 
with M{x) defined as the probability that the output of a universal monotone 
Turing machine starts with x when provided with fair coin flips on the input tape. 
Levin [ZL70] has shown that this M is a universal enumerable semimeasure. An- 
other possible definition of M is as a (Bayes) mixture [Sol64,ZL70,Sol78,LV97, 
Hut03b]: M{x) = where K{v) is the length of the shortest 

program computing function v. Levin [ZL70] has shown that the class of all enu- 
merable semimeasures is enumerable (with repetitions), hence M is enumerable, 
since K is co-enumerable. Hence MgM, which implies 

M{x) > Wj^M{x) > = w'j^v{x), where (2) 

Up to a multiplicative constant, M assigns higher probability to all x than any 
other enumerable semimeasure. All M have the same very slowly decreasing (in 
1 ^) domination constants w^, essentially because M G Ai. We drop the prime 
from w[, in the following. The mixture definition M immediately generalizes to 
arbitrary weighted sums of (semi) measures over other countable classes than Ai, 
but the class may not contain the mixture, and the domination constants may be 
rapidly decreasing. We will exploit this for the construction of the non-uni versal 
semimeasure W in Sections 5 and 6. 



3 Posterior Convergence with Probability 1 

The following convergence results for M are well-known [Sol78,LV97,Hut03a]. 

Theorem 3 (Convergence of M to /x w.p.l). For any universal semimeasure 
M and any computable measure y it holds: 

^{x'n\^<n) h{x'Jx<^n) for any and 1, both w.p.l forn^oo 

The first convergence in difference is Solomonoff’s [Sol78] celebrated con- 
vergence result. The second convergence in ratio has first been derived by 
Gacs [LV97]. Note the subtle difference between the two convergence re- 
sults. For any sequence x'^.^ (possibly constant and not necessarily random), 
M(x(,|x<„) — /x(x(j|a;<„) converges to zero w.p.l (referring to Xi-ao), but no 
statement is possible for M(x(j|x<„)//i(a;(,|a;<„), since liminf/x(x(j|a;<„) could 
be zero. On the other hand, if we stay on-sequence = a^i:oo), we have 




238 



M. Hutter and A. Muchnik 



M(a;„|a;<„)/^(a;„|a;<„)— >-1 (whether inf/i(a;„|a;<„) tends to zero or not does not 
matter). Indeed, it is easy to give an example where M(a;(,|a;<„)//i(x(j|x<„) di- 
verges. For /x(l|a;<„) = l-/x(0|a;<„) = in"3 we get = 

c = 0.450... > 0, i.e. 0i:oo is /i-random. On the other hand, one can show 
that M(0<„) = 0(1) and M(0<„1) = which implies ^ ’ 

2 ~K{n) ^ £qj. ^_^oq (K(n) ^ 21ogn). 

Theorem 3 follows from (the discussion after) Lemma 4 due to M(x) > 
w^li{x). Actually the Lemma strengthens and generalizes Theorem 3. In the 
following we denote expectations w.r.t. measure p by Ep, i.e. for a function 
E,p[f] = J2'^^^p{xi-,n)f{xi:n), where J2' sums over all xi-,n for which 
p(a;i:n)yfO. Using instead Y is important for partial functions / undefined 
on a set of p-measure zero. Similarly Pp denotes the p-probability. 

Lemma 4 (Expected Bounds on Hellinger Sum). Let p he a measure and 
V be a semimeasure with v{x) > w-p{x) Vx. Then the following hounds on the 
Hellinger distance ht{v,p\u}^t) ■=Yaex^\/^i^\'^<t) ~ hold: 



OO r 






-1 



(d («) 

< ^E[/ii] < 21n{E[exp(i 



2 



(Hi) 

ht)]} < In^ 



where E means expectation w.r.t. p. 

The lnt(;“ ^-bounds on the first and second expression have first been 
derived in [Hut03a], the second being a variation of Solomonoff’s bound 
^^E[(jz(0|a;<„)— p(0|x<„))^] < jlnw”^. If sequence XiX 2 ... is sampled from the 
probability measure p, these bounds imply 

n{x’Jxcn) p{x'Jxcn) for any x'„ and ’^■PA for oo, 

where w.p.l stands here and in the following for ‘with p-probability 1’. 

Convergence is “fast” in the following sense: The second bound (X)t®[^i] ^ 
lnw“^) implies that the expected number of times t in which ht>e is finite and 
bounded by -lnru“^. The new third bound represents a significant improvement. 
It implies by means of a Markov inequality that the probability of even only 
marginally exceeding this number is extremely small, and that Yt^t is very 
unlikely to exceed lnt<;“^ by much. More precisely: 

P[#{t : ht> e} > i(lnw“^ -I- c)] < PEt fo > lnw“^ -I- c] 

= P[exp(i ^t) > < i/^E[exp(P < e“°/^. 



Proof. We use the abbreviations pt = p(xt|a;<t) and pi-n = Pi' ■■■■ Pn = p{xi:n) for 
p&{p,v,R,N,...} and ht = YxSV^~ ' 

(i) follows from 

EK^-EE,] ^ ^ ^,(^-1)2= ^ (vA^-V^)2 < h, 

xt:ut¥^0 Xt'.fH^^O 

by taking the expectation E[] and sum 




Universal Convergence of Semimeasures on Individual Random Sequences 



239 



(ii) follows from Jensen’s inequality exp(E[/]) < E[exp(/)] for 
(mi) We exploit a construction used in [Vov87, Thm.l]. For discrete 
(semi)measures p and q with J2iPi = 1 < 1 it holds: 

< exp[-i^(v^- (3) 

i i i 

The first inequality is obvious after multiplying out the second expression. The 
second inequality follows from 1— a;<e“’*’. Vovk [Vov87] defined a measure Rt'-= 
y/pt^t/Nt with normalization Nt ^PP^yiiig (3) for measure p and 

semimeasure v we get Nt < exp{—^ht). Together with iy{x) >w-iJ,{x) \/x this 
implies 



n 



n«. 



n y/lMMt 
Nt 

t=i ' 



y/ Pl:n^l:n 

Nv.n 



— Ml:n 







n 

> Mi:«Vwexp(^ 

t=l 



Summing over xi-,n and exploiting = 1 we get 1 > y/wFi[exp{^^^ht)], 

which proves (Hi). 

The bound and proof may be generalized to 1 > w"’E[exp(i^j^^^(r'” — 
with 0<K< 5 by defining Rt = p\~'^Vt /Nt with Nt = J2xtPt~'^’^t and 
exploiting ° 

One can show that the constant | in Lemma 4 can essentially not been 
improved. Increasing it to a constant a> 1 makes the expression infinite for some 
(Bernoulli) distribution p (however we choose v). For u = M the expression can 
become already infinite for a> ^ and some computable measure p. 



4 Non-convergence in Martin-L6f Sense 

Convergence of M(x„|a;<„) to /i(a;„|a;<„) with /i-probability 1 tells us that 
M{x„\x<^n) is close to /i(x„|a;<„) for sufficiently large n on “most” sequences 
xi-,oo- It says nothing whether convergence is true for any particular sequence (of 
measure 0). Martin-L6f randomness can be used to capture convergence proper- 
ties for individual sequences. Martin-L6f randomness is a very important concept 
of randomness of individual sequences, which is closely related to Kolmogorov 
complexity and Solomonoff’s universal semimeasure M . Levin gave a character- 
ization equivalent to Martin-L6f ’s original definition [Lev73] : 

Definition 5 (Martin- Lof random sequences). A sequence w = wi;oo is p- 
Martin-Lof random (p.M.L.) iff there is a constant c<oo such that M(o'i:„)< 
c- p{uji:n) for all n. Moreover, dffui) := sup„{fog^^^j^yy } < logc is called the 
randomness deficiency of to. 

One can show that an M.L. -random sequence Xi,oo passes all thinkable effective 
randomness tests, e.g. the law of large numbers, the law of the iterated logarithm, 
etc. In particular, the set of all /x.M.L.-random sequences has /i-measure 1. 
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The open question we study in this section is whether M converges to /r (in 
difference or ratio) individually for all Martin-L6f random sequences. Clearly, 
Theorem 3 implies that convergence /i.M.L. may at most fail for a set of sequences 
with /i-measure zero. A convergence M.L. result would be particularly interesting 
and natural for M, since M.L. -randomness can be defined in terms of M itself 
(Definition 5). 

The state of the art regarding this problem may be summarized as follows: 
[Vov87] contains a (non-improvable?) result which is slightly too weak to imply 
M.L.-convergence, [LV97, Thm.5.2.2] and [VLOO, Thm.lO] contain an erroneous 
proof for M.L.-convergence, and [Hut03b] proves a theorem indicating that the 
answer may be hard and subtle (see [Hut03b] for details). 

The main contribution of this section is a partial answer to this question. We 
show that M.L.-convergence fails at least for some universal semimeasures: 

Theorem 6 (Universal semimeasure non-convergence). There exists a 
universal semimeasure M and a computable measure p, and a p. M.L. -random 
sequence a, such that 

M{an\a<^n) p{ctn\a<n) for n-)>oo. 

This implies that also M^j Pn does not converge (since < 1 is bounded). We 
do not know whether Theorem 6 holds for all universal semimeasures. The proof 
idea is to construct an enumerable (semi)measure v such that n dominates M 
on some /r-random sequence a, but :^(a„|a<„) 7 b'/x(a;„|a<„). Then we mix M to 
V to make n universal, but with larger contribution from n, in order to preserve 
non-convergence. There is also non-constructive proof showing that an arbitrary 
small contamination with v can lead to non-convergence. We only present the 
constructive proof. 

Proof. We consider binary alphabet A={0,1} only. Let p{x) = \{x) be 

the uniform measure. We define the sequence a as the (in a sense) lexicograph- 
ically first (or equivalently left-most in the tree of sequences) A. M.L. -random 
sequence. Formally we define a, inductively in n= 1,2,3,... by 

a„ = 0 if M(a<„0) < 2“”, and a„ = 1 else. (4) 

We know that M(e) < 1 and M(a<„0) < 2“” if = 0. Inductively, assuming 
A7(a<n) <2“”+^ for a„ = l we have 2“”+^ >AL(a<„) >M(a<„0)-|-M(a<„l) > 
2 “"-|-M(q;<„ 1) since M is a semimeasure, hence M(a<„l) < 2“". Hence 

Af(ai:„) < 2“” = A(ai:„)Vn, i.e. a is A. M.L. -random. (5) 

Let M* with t= 1,2,3,... be computable approximations of M, which enumerate 
M, i.e. M*{x)/^M{x) for t— >-oo. W define a* like a but with M replaced by 
M* in the definition. M* M implies of/' a (lexicographically increasing). We 
define an enumerable semimeasure v as follows: 

{ 2“* if l(x) = t and x < a\.^ 

0 if £(x) = t and x > a\.f 

0 if i{x) > t 

v*{x{f)-\-/{x\) if £{x) < t 



(6) 
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where < is the lexicographical ordering on sequences, v* is a semimeasure, and 
with Of* also v* is computable and monotone increasing in t, hence i/:=limt_>oo^^* 
is an enumerable semimeasure (indeed, is a measure). We could have defined 
a vtn by replacing a\.^ with q;".^ in (6). Since i^tn is monotone increasing in t 
and n, any order of t,n— >-oo leads to v, so we have chosen arbitrarily t = n. By 
induction (starting from £{x)=t) it follows that 

if X < and £{x) < t, = 0 if x > a\.^ 

On-sequence, i.e. for x = ai-,n, is somewhere in-between 0 and Since 

sequence a:=limta‘ is A. M.L. -random it contains 01 infinitely often, actually 
a„Qf„+i = 01 for a non- vanishing fraction of n. In the following we fix such an n. 
For t>n we get 

f"*(a<n) = J^‘(a<n0)-fi^*( = r/‘(a<„0) = ^ jy(a<„) = jy(ai,„) 

since o;n=0 

This ensures J^(a„|a<„) = 1 yf | = A„. For t>n large enough such that = 

ai:n+i we get: 

= 2-"-i ^ i.(ai,„) > 2-"-i 

^‘^l’n+1’ 0:n.+ l = l 

This ensures j^(q;i:„) > 2“”“^ > ^M{ai-,n) by (5). Let M be any universal 
semimeasure and 0 < 7 < |. Then M'{x) := {l — ^)v{x)+^M{x)'ix is also a 
universal semimeasure with 



M'(o;„|a<„) 



y{a<n) 



M{a<n) < 2-"+i and M(ai:„) > 0 



(l-7)j/(a<„) -I- 7 M(q;<„) “ (l- 7 ):^(a<„) -|- 72 -"+i 



^ 

1-7-|-72-"+Vi^(q:i:„) 

t 

= J^(ai:n) iy{ai:n) > 2” 



> 

t 



1-7 
1-f 37 



> 



1 

2 ' 



For instance for 7 = | we have M'(a„|a<„) > | yf | = A(a„|o;<„) for a non- 
vanishing fraction of n’s. □ 

A converse of Theorem 6 can also be shown: 



Theorem 7 (Convergence on non-random sequences). For every univer- 
sal semimeasure M there exist computable measures /x and non- p,. M.L. -random 
sequences a for which M(a„|o;<„)//x(a„|a<„)— l-l. 
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5 Convergence in Martin-L6f Sense 

In this and the next section we give a positive answer to the question of posterior 
M.L. -convergence to /i. We consider general finite alphabet X. 

Theorem 8 (Universal predictor for M.L.-random sequences). There 
exists an enumerable semimeasure W such that for every computable measure p, 
and every p,. M.L. -random sequence uj, the posteriors converge to each other: 

W{a\L0^t) M(o|‘^<t) /or cill a G X if d^{iv) < oo. 

The semimeasure W we will construct is not universal in the sense of domi- 
nating all enumerable semimeasures, unlike M . Normalizing W shows that there 
is also a measure whose posterior converges to p, but this measure is not enumer- 
able, only approximable. For proving Theorem 8 we first define an intermediate 
measure D as, a, mixture over all computable measures, which is not even approx- 
imable. Based on Lemmas 4,9,10, Proposition 11 shows that D M.L. -converges to 
p. We then define the concept of quasimeasures and an enumerable semimeasure 
IT as a mixture over all enumerable quasimeasures. Proposition 12 shows that 
W M.L.-converges to D. Theorem 8 immediately follows from Propositions 11 
and 12. 

Lemma 9 (Hellinger Chain). Let /i(p,9) Hellinger 
distance between p= {pi)fL,iG IR^ and q= {qi)fLiG . Then 

i) forp, q,r G h{p, g) < (1 -l- (3) h{p, r) -I- (1 -I- f)~^) h{r, q), any P > 0 

m 

a) forp^, ...,p™ G h{p^,p™) < 3 ^ h{p^~^,p^) 

Proof, (z) For any x,yGlR and /3>0 we have {x-\-y)'^<{l-\-P)x'^-\-{l-\-P~^)y‘^. 
Inserting x = ^/pl— ^/rl and y = ^/r~i — y/qi and summing over i proves (i). 

{ii) Apply (i) for the triples for and in order of fc= l,2,...,m— 2 

with /3 = /3fc = fc(fc-|-l) and finally use nj=i(l+/37^) 3. □ 

We need a way to convert expected bounds to bounds on individual M.L. 
random sequences, sort of a converse of “M.L. implies w.p.l”. Consider for in- 
stance the Hellinger sum iL(w) :=^“^/it(/i,p)/lnw~^ between two computable 
measures p > w-p. Then H is an enumerable function and Lemma 4 implies 
E[iL] < 1, hence H is an integral p-test. H can be increased to an enumerable 
/i-submartingale H. The universal /x-submartingale M/p multiplicatively domi- 
nates all enumerable submartingales (and hence H). Since this im- 

plies the desired bound H{uj) < 2'^'*^“^) for individual uj. We give a self-contained 
direct proof, explicating all important constants. 

Lemma 10 (Expected to Individual Bound). Let F{uj)>0 be an enumer- 
able function and p be an enumerable measure and e>0 be co- enumerable. Then: 
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If E^[F] < e then F{lo) ^ Vw 

where d^{uj) is the ^-randomness deficiency of oj and K{^,F,^/e) is the length 
of the shortest program for pt, F , and 

Lemma 10 roughly says that for /x, F, and e = E^[_F] with short program 
(iC(/x,F,Y£) = 0(1)) and /x-random w (d^(ax) = 0(1)) we have F(uj) Y E^[F]. 

Proof. Let F(u;) = lim„_>ooE„(a;) =sup„E„(a;) be enumerated by an increasing 
sequence of computable functions Fn{uf). Fn{oj) can be chosen to depend on u;i:„ 
only, i.e. F„{co) = Fn{uJi-.n) is independent of u;„+i:oo- Let ef\s co-enumerate £. 
We define 

iin(wi:fe) := m(wi:„)E„(wi:„) for fc < n, and /x„(wi:fc)=0 for k>n. 

fin is a computable semimeasure for each n (due to <£:) and increasing 

in n, since 

/in(wi:fc) > 0 = /x„_i(wi:fe) for k>u and 

t^n{j-^<n) ^ ^ ^ M(^l:n)-Ci— 1 (cc<ri) £y, /x(^U<7^)-Fi,_ i (cU<yi) ^ fin—l{^<n) 

t t 

F„ > Fn-i /X measure £n < £n-i 



and similarly for k<n—l. Hence /x:=/ioo is an enumerable semimeasure (indeed 
fi is proportional to a measure). From dominance (2) we get 

M(cui,„) ^ 2-^('^)/x(u;i,„) > 2-^('^)^n(wi:n) = F„{u;,,„) . 

( 7 ) 

In order to enumerate pi, we need to enumerate pi, F, and e hence 
7L(ix) ^ K{fi,F, Y£), so we get 



Fnico) = F„(u;i,„) < 



Taking the limit F„/'F and £„\£ completes the proof. □ 

Let A4 = {vi,V 2 ,...} be an enumeration of all enumerable semimeasures, Jk'= 
{i<k : Vi is measure}, and Sk{x) :=^^^j^SiVi(x). The weights £j need to be 
computable and exponentially decreasing in i and — 7- We choose Si = 

Note the subtle and important fact that although the definition of Jk is 
non-constructive, as a finite set of finite objects, is decidable (the program 
is unknowable for large k). Hence, 6k is computable, since enumerable measures 
are computable. 



D{x) = 6 

OO (x) = E £iVi{x) = mixture of all computable measures. 

i€Jac 



In contrast to Jk and 6k, the set Jqo and hence D are neither enumerable nor 
co-enumerable. We also define the measures 6k{x) := 6k{x) / 6k{e) and D{x) := 
D{x) / D{e). The following Proposition implies posterior convergence oi D to pi 
on /x-random sequences. 
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Proposition 11 (Convergence of incomputable measure D). Let fi be 

a computable measure with index ko, i.e. p, = Vko- Then for the incomputable 
measure D and the computable but non- constructive measures defined above, 
the following holds: 

i) ^ 21n2-(i^(w) + 3fco 

n)EZi htiho,D) Z kl2’^o+d,.(-) 

Combining (i) and (ii), using Lemma 9(i), we get ^Z=i^t{td,D)<Cu,f{ko)<oo 
for /i-random u, which implies Z)(6|w<t) = il(5|u;<t)— We do not know 
whether on-sequence convergence of the ratio holds. Similar bounds hold for dk^ 
instead Skg, ki>ko. The principle proof idea is to convert the expected bounds 
of Lemma 4 to individual bounds, using Lemma 10. The problem is that D 
is not computable, which we circumvent by joining with Lemma 9, bounds on 
Y,tht{5k-i,5k) for fc = A:o,^o + l,--- 

Proof, (t) Let H{u]) := Z2tZi^t{Skg,pi)- la and Skg are measures with Skg > 
dko ^ since 6k(e) < 1, M = ^fco G dk„. Hence, Lemma 4 ap- 

plies and shows E^[exp(iiL)] < Skl^"^ ■ ^ well-defined and enumerable for 
d^{ijj) < oo, since d^{uj) < oo implies 0 implies <5fcp(wi:t) yf 0. So 

pL{b\uJi:t) and Skg{b\uji,t) are well defined and computable (given Jkg)- Hence 
ht{5kg,p) is computable, hence H{iJ) is enumerable. Lemma 10 then implies 
exp(iiL(w)) We bound 

K{n,H,^/ekg) Z K{H\ti,kQ) + K{ko) Z K{Jkg\ko) + K{ko) Z fco + 21ogfco- 

The first inequality holds, since fco is the index and hence a description of /x, and 
£, is a simple computable function. H can be computed from /i, ko and Jk^, which 
implies the second inequality. The last inequality follows from K{kfi) < 21ogA:o 
and the fact that for each i < /cq one bit suffices to specify (non)membership to 
Jkg, i.e. K{Jkg\kfi) < ko- Putting everything together we get 

H{uj) ^ ln£^^^ -I- [/co + 21ogfco + c^/i(‘^)]21n2 Z (2 In 2)d^(w) -I- Sfcg. 

(ii) Let H^{ufi):=YZZiht{dk,5k-i) and k>ko. Sk-i<Sk implies 

Sk-i{x) ^ Sk(e) ^ 4-i(e) -I- £fc _ 1 , £fc < i , 

dk(x) dk-i(e) Sk-i(e) Sk-i(e) £o ' 

where O := min{f G Jfc_i} = 0(1). Note that Jk-i5ko is not empty. Since Sk-i 
and 6k are measures. Lemma 4 applies and shows ^ [H^ ]<ln(l+f^)<f^. 
Exploiting Ekgpi < 6k-i, this implies E^[iL^] < Lemma 10 then implies 

H^iui) Z .£oefco/£fc)+dM(“), Similarly as in (i) we can bound 

,£kg/£oEk) z K{Jk\k) + K{k) + K{kfi) ^ A; -I- 2 log fc -I- 2 log fco , hence 




Universal Convergence of Semimeasures on Individual Random Sequences 



245 



^ ■kQk‘^2^Cuj = kn2'^°k where := 

Chaining this bound via Lemma 9(ii) we get for ki>ko'. 

n n k\ 

Y.h,{h.,K)<Y.^ Y. (k — kQ + lY'htiSk-i, Sk) 

t=l t=l k=ko + l 

k\ ki 

< 3 2 k^H^ico) ^ 3fc^2'=«c^ fc-2 < 3kl2’^°c^ 

k—ko-\-l k—ko-\-l 

If we now take ki^co we get Y^^=i^t{5ko^D) ^ 3/cq 2*°+‘^'‘(‘^). Finally let n— l-oo. 

□ 

The main properties allowing for proving D — >■ /i were that D is a measure 
with approximations 5k, which are computable in a certain sense. H is a mixture 
over all enumerable/computable measures and hence incomputable. 

6 M.L.-Converging Enumerable Semimeasure W 

The next step is to enlarge the class of computable measures to an enumer- 
able class of semimeasures, which are still sufficiently close to measures in order 
not to spoil the convergence result. For convergence w.p.l. we could include 
all semimeasures (Theorem 3). M.L. -convergence seems to require a more re- 
stricted class. Included non-measures need to be zero on long strings. We convert 
semimeasures v to “quasimeasures” k as follows: 

k{xi.,n) ■= if Y > 1 ~ ~ and I^{xi:n) ■= 0 else. 

yi-.n 

If the condition is violated for some n it is also violated for all larger n, hence 
with V also v semimeasure, v is enumerable if v is enumerable. So if vi,V 2 ,... is 
an enumeration of all enumerable semimeasures, then vi,V 2 ,... is an enumeration 
of all enumerable quasimeasures. The for us important properties are that Vi<Vi 
-and- if Vi is a measure, then Vi = Vi, else Vi{x) = 0 for sufficiently long x. We 
define the enumerable semimeasure 

<50 

W(x) := 'Y^SiVi(x), and D{x) = 'Y^£ii>i{x) with J := {i : Vi is measure} 

i—1 i^J 

with ei = t“®2“* as before. 

Proposition 12 (Convergence of enumerable W to incomputable D). 

For every computable measure p, and for to being p-random, the following holds 
for t^co: 

WipJi.t) W{a\uj<t)^ D{a\uj<t) Va£X. 
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The intuitive reason for the convergence is that the additional contributions 
of non-measures to W absent in D are zero for long sequences. 

Proof, (t) 

D(x) < W{x) = D(a;) + ^ £iZ>i(a;) < D(a;) -I- ^ (8) 

i^J i=kx 



where kx'=xaini{i^J:iyi(x)y^O}. For i^J, ki is not a measure. Hence ki(x) = 0 
for sufficiently long x. This implies kx^oo for ^(a;)— >-oo, hence W{x) ^ D(x) 
\/x. To get convergence in ratio we have to assume that x = uji.n with w being 
^-random, i.e. = <oo. 

=> ki^x) < Vi{x) < M{x) < — < — - — D{x), 

Wui WuiSko 

The last inequality holds, since /x is a computable measure of index k^, i.e. 
IJ, = Vko = ^ko- Inserting lfw^^<c ' for some c=0(l) and £^ we get £i£'i(x) < 
^^i~'^2~’‘D{x), which implies Yl^k^£iVi{x)<£'xD{x) with — >• 

0 for £(x)->-oo. Inserting this into (8) we get 



1 < 



W{x) 

D{x) 



^ 1 + 



c)^oo 



for /x-random x. 



{ii) Obvious from (i) by taking a double ratio. 

(Hi) Let oGT. From W(xa)>L>(xa) (W>D) and W(x) < (l + £'x)D(x) (i) 
we get 



IF(a|a;) > (1 -I- £^) ^0(a|a;) > (1 — £^)I?(a|x) Va G T, and 
l-M^(a|a:)>^IF(6|a:) > (1 - £'jJ^ B(blx) = (1 - £'J(1 - i?(a|x)), 

b^a b^a 



where we used in the second line that IF is a semimeasure and D proportional 
to a measure. Together this implies |IF(a|x) — I?(a|x)| < £^. Since £^ — >■ 0 for 
^-random x, this shows (in). hx(W,D)<e'x can also be shown. □ 



Speed of convergence. The main convergence Theorem 8 now immediately 
follows from Propositions 11 and 12. We briefly remark on the convergence rate. 
Lemma 4 shows that Ei[^^ht(X,^)] is logarithmic in the index kg of /x for X = M 
(Inxc^^^ =lnfco), but linear for X=\W,D,5ko\ (ln£fco = fco). The individual bounds 
^t^t(Skojk) ^tht(SkQ,D) in Proposition 11 are linear and exponential 

in ko, respectively. For W D we could not establish any convergence speed. 

Finally we show that IF does not dominate all enumerable semimeasures, 
as the definition of IF suggests. We summarize all computability, measure, and 
dominance properties of M, D, D, and IF in the following theorem: 

Theorem 13 (Properties of M, IF, D, and D). 

(i) M is an enumerable semimeasure, which dominates all enumerable semimea- 
sures. M is not computable and not a measure. 
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(ii) D is a measure, D is proportional to a measure, both dominating all enu- 
merable quasimeasures. D and D are not computable and do not dominate all 
enumerable semimeasures. 

(Hi) W is an enumerable semimeasure, which dominates all enumerable qua- 
simeasures. W is not itself a quasimeasure, is not computable, and does not 
dominate all enumerable semimeasures. 

We conjecture that D and D are not even approximable (limit-computable), 
but lie somewhere higher in the arithmetic hierarchy. Since W can be normalized 
to an approximable measure M.L. -converging to p,, and D was only an interme- 
diate quantity, the question of approximability of D seems not too interesting. 



7 Conclusions 

We investigated a natural strengthening of Solomonoff’s famous convergence the- 
orem, the latter stating that with probability 1 (w.p.l) the posterior of a univer- 
sal semimeasure M converges to the true computable distribution p p). 

We answered partially negative the question of whether convergence also holds 
individually for all Martin-L6f (M.L.) random sequences (3M :M p). We 
constructed random sequences a for which there exist universal semimeasures 
on which convergence fails. Multiplicative dominance of M is the key property 
to show convergence w.p.l. Dominance over all measures is also satisfied by the 
restricted mixture W over all quasimeasures. We showed that W converges to p 
on all M.L.-random sequences by exploiting the incomputable mixture D over all 
measures. For D ^ we achieved a (weak) convergence rate; for W D and 
1 asymptotic result. The convergence rate properties w.p.l. of 

D and W are as excellent as for M. 

We do not know whether 1 holds. We also don’t know the conver- 

gence rate for W D, and the current bound for D p is double expo- 
nentially worse than for M A minor question is whether D is approx- 

imable (which is unlikely). Finally there could still exist universal semimeasures 
M (dominating all enumerable semimeasures) for which M.L. -convergence holds 

(3M : M pi). In case they exist, we expect them to have particularly in- 
teresting additional structure and properties. While most results in algorithmic 
information theory are independent of the choice of the underlying universal 
Turing machine (UTM) or universal semimeasure (USM), there are also results 
which depend on this choice. For instance, one can show that {{x,n) :Ki/(x) <n} 
is tt-complete for some U, but not tt-complete for others [MP02]. A potential U 
dependence also occurs for predictions based on monotone complexity [Hut03d] . 
It could lead to interesting insights to identify a class of “natural” UTMs/USMs 
which have a variety of favorable properties. A more moderate approach may 
be to consider classes Ci of UTMs/USMs satisfying certain properties Vi and 
showing that the intersection HiCi is not empty. 
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Another interesting and potentially fruitful approach to the convergence 
problem at hand is to consider other classes of semimeasures A4, define mix- 
tures M over Af , and (possibly) generalized randomness concepts by using this 
M in Definition 5. Using this approach, in [HutOSb] it has been shown that con- 
vergence holds for a subclass of Bernoulli distributions if the class is dense, but 
fails if the class is gappy, showing that a denseness characterization of A1 could 
be promising in general. 

Acknowledgements. We want to thank Alexey Chernov for his invaluable 
help. 
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Abstract. It is well known that there exists a universal (i.e., optimal to 
within an additive constant if allowed to work inhnitely long) algorithm 
for lossless data compression (Kolmogorov, Levin) . The game of lossless 
compression is an example of an on-line prediction game; for some other 
on-line prediction games (such as the simple prediction game) a universal 
algorithm is known not to exist. In this paper we give an analytic charac- 
terisation of those binary on-line prediction games for which a universal 
prediction algorithm exists. 



1 Introduction 

We consider the following on-line prediction protocol: the prediction algorithm 
is fed with a sequence of bits a;i,a; 2 , ■ • • , and its goal is to predict each LOt, t = 
1,2,..., given the previous bits wi, . . . , Wt_i. The loss suffered by the algorithm 
at each trial is measured by a loss function A. We are interested in universal 
prediction algorithms for this problem. 

The first universal algorithm was found back in 1965 by Kolmogorov for 
a batch version of this problem. Kolmogorov found a universal method of com- 
pression: for each string one can generate shorter and shorter compressed strings 
so that optimal compression is achieved in the limit; the original string can be 
restored from each of these compressed string by a computable function. The 
length of the shortest compressed string is known as the (plain) Kolmogorov 
complexity of the original string. 

Slightly later Levin (in [1]) put Kolmogorov’s ideas in an on-line framework 
constructing a universal algorithm for on-line data compression; the version of 
Kolmogorov complexity given by Levin’s construction is denoted KM (the log 
of the a priori semimeasure). There are also several ‘semi-on-line’ versions of 
Kolmogorov complexity (such as prefix and monotone complexity). 

Much later (see [2]) it was shown that a universal algorithm exists for a 
wide class of on-line prediction games including, besides some traditional games 
of prediction, the data compression game and games of playing the market; 
the loss suffered by the universal algorithm on a string is called the predictive 

* A version of this paper with more details is available as Technical Report CLRC-TR- 
04-04 (revised). Computer Learning Research Centre, Royal Holloway, University of 
London; see http://www.clrc.rhul.ac.uk/publications/techrep.htm 
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complexity of that string. The technical basis for this result is provided by the 
theory of prediction with expert advice (see [3,4,5]). 

The first on-line prediction games for which a universal prediction algorithm 
(equivalently, predictive complexity) does not exist appear in [6,7]. They in- 
clude such important games as the simple prediction game and the absolute-loss 
game. The main result of this paper, Theor. 1, is an analytic characterisation of 
those on-line prediction games for which a universal prediction algorithm exists. 
The theorem states that a universal algorithm exists if and only if the game is 
mixable. 

The proof relies on an analytical characterisation of mixable games provided 
in Sect. 4. It generalises the criterion from [8] and may be used for checking 
whether a game is mixable. 



2 Preliminaries 

2.1 Games and Superpredictions 

An (on-line prediction) game © is a triple (B,T, A), where B = {0,1} is the 
outcome space^, F is the prediction space, and A : B x T — >■ H U |+oo} is the 
loss function. 

It is essential to allow A to take the value -l-oo; however we do not allow — oo 
since the sum -|-oo -I- (— oo) is undefined. We also prohibit the values of A to 
approach — oo: we require that there is some a G IR such that A(w, 7) > a for all 
w G B and 7 G T. Having permitted -koo, we need to ensure that at least some 
values of A are finite, otherwise the game is degenerate. We require that there is 
7o G T such that both A(0,7o) and A (1,70) are finite numbers. In fact, our main 
result remains true for the games that do not satisfy these requirements, but we 
do not cover such games in this paper. 

We need to impose computability requirements on the games. Assume that 
with every game there comes an oracle that is able to answer natural questions 
about the game. All computations involving the game can include calls to the 
oracle. The exact requirements to the oracle are given in Sect. 5. 

We will denote elements of B* (i.e., finite strings of elements of B) by bold 
letters, e.g., x,y. The length (i.e., the number of elements) of a string x is 
denoted by |a;|; the number of ones in x and the number of zeroes in x are 
denoted by '(\X and Do®, respectively. 

Let us describe the intuition behind the concept of a game. Consider a pre- 
diction algorithm 2t working according to the following protocol: 



for t = 1,2,... 

(1) 21 chooses a prediction & F 

(2) 21 observes the actual outcome w* G B 

^ In this paper we restrict ourselves to games with the outcome space B. These games 
are sometimes called ‘binary’. A more general definition is possible. 
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(3) 21 suffers loss A(wt,7t) 

end for 

The algorithm 21 suffers the total loss Lossa(wi, W2 , ■ • • ,wt) = 7t) 

over the first T trials. By definition, put Lossa(Al) = 0, where A denotes the 
empty string. 

The function Lossa(a;) can be treated as the predictive complexity of x in 
the game 0 w.r.t. 21. We will call these functions loss processes. Unfortunately, 
the set of all loss processes for a given game has no minimal elements except in 
some degenerate cases. The set of loss processes should be extended to the set 
of superloss processes. 

2.2 Superloss Processes and Predictive Complexity 

Take a game 0 = (B,r, A). We say that a pair (so,si) G (— oo,+oo]^ is a 
superprediction if there is 7 G T such that the inequalities A(0,7 ) < sq and 
A(l, 7) < Si hold. The set of superpredictions is denoted by S and is an important 
object characterising the game. 

The requirements we have imposed on functions A can be reformulated in 
terms of S as follows. The set S Cl IR^, the finite part of S, is not empty and 
there is a G IR such that S C [a, +00]^. 

A function L : B* — >• IRU{+oo} is called a superloss process w.r.t. 0 (see [2]) 
if the following conditions hold:(l) L{A) = 0, (2) for every x G B*, the pair 
(L{x0) — L{x),L{xl) — L{x)) is a superprediction w.r.t. 0, and (3) L is semi- 
computable from above (w.r.t. the oracle of the game). 

We will say that a superloss process K is universal if for any superloss process 
L there exists a constant C such that K{x) < L{x) + C for all a; G B*. The 
difference between two universal superloss processes w.r.t. 0 is bounded by a 
constant. If universal superloss processes w.r.t. 0 exist, we may pick one and 
denote it by /C® . It follows from the definition that, for every prediction algorithm 
21, there is a constant C such that for every x we have lC^{x) < Loss§(a;) -I- C. 
We call /C® (predictive) complexity w.r.t. 0. We will usually omit the superscript 
and write just 1C. 

2.3 Mixability 

Mixability was introduced in [5,2]. Take a parameter /3 G (0,1) and consider 
the homeomorphism 25,3 • (— oo,-|-oo]^ — >■ [0,-|-oo)^ specified by the formula 
^fs{x,y) = {(3^,(3^). A game 0 with the set of superpredictions S is called (3- 
mixahle if the set 25/3(5') is convex. A game 0 is mixable if it is /3-mixable for 
some (3 G (0, 1). 

Examples of mixable games are the logarithmic game with T = [0,1] and A 
given by the formulae A(0,7) = — log2(l — 7) and A(l,7) = — log2(7) and also 
the square-loss game with F = [0, 1] and A(w,7) = (w — 7)^. They specify the 
logarithmic complexity K)°^ and the square-loss complexity respectively (see 
[2]). Logarithmic complexity coincides with the negative logarithm of Levin’s a 
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Fig. 1. The set of sup erpredict ions S for the absolute- loss game. 



priori semimeasure (see [9] for a definition) . The negative logarithm of Levin’s 
a priori semimeasure is a variant of Kolmogorov complexity. Thus we may say 
that Kolmogorov complexity is a special case of predictive complexity. 

Natural examples of games that are not mixable {non-mixable) are provided 
by the absolute-loss game with F = [0,1] and = jw — yj (the set of 

superpredictions for the absolute-loss game is shown in Fig. 1) and the simple 
prediction game with F = M and A(w, 7) = 0 if and only if a; = 7. 

It is easy to prove the mixability or non-mixability of the above games using 
Theor. 2 below. 

3 The Main Result and the Idea of the Proof 

Theorem 1. Predictive complexity w.r.t. a game 0 exists if and only if 0 is 
mixable. 

It is assumed in the theorem that the oracle for the game 0 satisfies the require- 
ments from Sect. 5. We include those to make the statement of the theorem 
precise. 

The ‘if’ part of the theorem is proved in [2]. The ‘only if’ part is proved in 
Appendix A. 

In order to illustrate the idea of the proof, we consider a special case. Let us 
show that there is no predictive complexity for the absolute-loss game defined 
in Subsect. 2.3. The set of superpredictions for the absolute-loss game is shown 
in Fig. 1. The proof is by reductio ad absurdum. Suppose that /C is predictive 
complexity w.r.t. the absolute-loss game. 

The set S of superpredictions for the absolute-loss game lies above the 
straight line xl2 + yl2 = 1/2. It is easy to see that 

E/C >n/2 



( 1 ) 
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for all positive integers n, where are results of n inde- 

pendent Bernoulli trials with the probability of success equal to 1/2. Indeed, the 
mean ‘loss’ of L on each ‘trial’ is at least 1/2 (see Lemma 1). 

Consider the prediction strategy that outputs 0 no matter what outcomes 
actually occur. It suffers loss L{x) = '^ix for all strings x. It follows from the 
definition of predictive complexity that there is a constant C such that JC{x) < 
L{x) + Ci = hiX + Ci for all strings x. Similar considerations imply that IC{x) < 
'iox + Co for some Cq > 0. Letting C = max(Co, Ci) > 0 yields 

/C(a;) < min(ttoa:,tliic) -h C < -^ -h C . (2) 

Let S' C B" be the set of strings having nj^ + ^Jn or more Os. We will use the 
upper bound YZ(x) < L(x) + C < n/2 — ^/n + C on fC for x G E and the upper 
bound (2) for all other values of x. It follows from the DeMoivre-Laplace limit 
theorem (see, e.g., [10]) that Pr(S) tends to a positive constant S as n tends to 
oo, where Pr(S) is the probability of the event G S. 

Thus 

T) 

= — — \/n{5 + o(l)) -I- C 



as n ^ - 1 - 00 . This contradicts (1). 

In the general case we approximate the boundary of the set of superpredic- 
tions S with line segments and perform a similar trick. The approximation takes 
place near a point where mixability gets violated. In order to find such a point, 
we need the concept of a canonical specification and a criterion of mixability 
formulated in Sect. 4. 



4 Canonical Specifications 

The proof of the main theorem is based on the concept of canonical specification. 
In this section we introduce canonical specifications and establish a criterion of 
mixability in terms of canonical specifications. The criterion is of independent 
interest because it reduces the question of whether a game is mixable to a simple 
analytical problem. 



4.1 Definitions 

Let 0 be a game with the set of superpredictions S. Consider the function 
/ : IR — >■ (— oo,-|-oo] defined by f{x) = inf{j/ G IR | (cc,y) G S'} for each a; G IR 
(here we assume that inf 0 = -t-oo). 

Let a = sup{a; G IR | f{x) = -l-oo}; since by our assumption S has a non- 
empty finite part, a < -l-oo. Let b = infja; G IR | / is constant on [x, -|-oo)}, 
where we again assume that inf 0 = -|-oo. It is possible that b = -|-oo. Clearly, 
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b > a. If b > a, let / be the restriction of / to (a, b). We will call such / : (a, b) 

IR the canonical specification of the game 0. 

If a = 6, then the closure of S' fl coincides with the set [a, +oo] x [c, +oo] 
for some c G M. It is easy to see that a game with the set of superpredictions 
S = [a, +oo] X [c, +oo] is mixable and if the requirements to the oracle from 
Sect. 5 hold, then there is predictive complexity w.r.t. this game. 

Note that if S is the closure of its finite part, then the canonical specification 
/ : (a, 6) — >■ IR uniquely determines S. Indeed, in this case S is the closure w.r.t. 
the extended^ topology of the set {{x,y) £ [— oo,+oo]^ | 3 m G (a, 6) : x > 
u and y > f{x)}. 

For example, the canonical specification for the absolute-loss game defined 
in Subsect. 2.3 is f{x) = 1 — x, x G (0, 1). 

Corollary 1 from [8] provides a criterion of mixability for the case of twice 
differentiable canonical specifications. Namely, it states that the game is mixable 
if and only if the fraction — 1)) is positive and separated from 0. We 

are going to generalise this criterion to a more general case where the existence 
of /' and /" is not assumed. 



4.2 Derivatives 



Let / : (a, 6) — >■ IR be the canonical specification of a game © with the set of 
superpredictions S. 

We need to define the following derivatives that make sense even if / is not 
smooth. First let us define the left and right derivatives: 



//(x) = lim 
Ax—¥0— 



/(x -k Ax) - /(x) 
Ax 



/'(x) = lim 
^ Ax^0+ 



/(x -k Ax) - /(x) 
Ax 



If //(x) = fr{x) at some point x, we say that the derivative f'{x) = //(x) = fr{x) 
exists. 

Suppose the derivative /'(x) exists at x G (a, 6). Let 

f" = lim inf /(^ + ~ (/(^) + 

— * Ax^O- i(Z\x)^ 

f" = lim inf /(^ + ~ (/(^) + /'(^)^^) 

-»■ Ax^0+ ^L\x)2 

By definition, lower limits always exist though their values may reach — oo. Let 
/"(x) = min(/"(x),/"(x)). 

If SnlR^, the finite part of S, is convex, then / is convex. Since convex func- 
tions always have one-sided derivatives, //(x) and /^(x) exist for each x G (a, b). 
Moreover they coincide everywhere except on a countable subset of the interval 
(a, 6) (see, e.g., [11], Theor. 11. C). The derivatives f' and /" are nonnegative 
for convex functions /. 

^ The extended topology on the set [— oo, -too] is generated by the Euclidean topology 
of IR and all sets (a, -l-oo] and [— oo,a), a G IR. The extended topology on the set 
[— oo,-too]^ is generated by the Cartesian product of extended topologies. 



( 3 ) 

( 4 ) 
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4.3 An Analytical Characterisation of Mixable Games 

In this subsection we formulate some results characterising mixability in terms 
of canonical specifications. 



Theorem 2. Let & be a game with the set of superpredictions S and a canonical 
specification f : (a, b) — >■ IR. Then 0 is mixable if and only if the following 
conditions hold: S = S fllR^ w.r.t. the extended topology, f is convex, and 



inf 



fix) 



xe(a,b)\D f'{x){f'{x) - 1) 



> 0 



( 5 ) 



where D is the set of points y G (a,b) such that f'fiy) friv)- 



This theorem follows from the next theorem: 



Theorem 3. Let & be a game with the set of superpredictions S and the canon- 
ical specification f : (a, 6) — >■ M; let (3 G (0, 1). Let S be the closure of its finite 
part, S = S nlR^. Then 0 is (3-mixable if and only if the following conditions 
hold: f is convex and 



f'{x)>-f'{x){f'{x)-l)\nl3 (6) 

for all X G (a,b) \ D, where D is the set of points y G {a,b) such that f'fiy) 

friv)- 

The theorem can be proved similarly to Corollary 1 from [8]. Note that the 
computability requirements from Sect. 5 are not used in the proofs of the two 
theorems from this subsection. 

5 Computability 

In this section we discuss the requirements the oracle for a game 0 with the 
set of superpredictions S should satisfy. The general principle is that the oracle 
should be able to answer all ‘reasonable’ questions about the game; we do not 
want, however, to formalise the notion of ‘reasonable’ and just present a rather 
long list of questions that the oracle should be able to answer for our proofs to 
work. All those questions can be easily answered for games specified in a natural 
analytical form, e.g., for all games defined in Subsect. 2.3. Thus the requirements 
are not particularly restrictive as far as applications are concerned. 

A rational interval is either an interval (p,q), where p and q are rational 
numbers or +oo or — oo, one of the intervals [— oo,p), (p, +oo], where p is a 
rational number, or [— oo,+oo]. A rational interval is open w.r.t. the extended 
topology of [— oo, +oo]. We say that a sequence of nested intervals specifies some 
number x G [— oo, +oo] if it shrinks to x, i.e., the intersection of all intervals 
from the sequence is the one-point set {a;}. 
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In order to prove the existence of complexity for mixable games, we require 
that the interior of the set of superpredictions S is constructively open w.r.t. 
the oracle in the sense of the following definition. Suppose that the oracle is 
(simultaneously) getting two sequences of rational intervals shrinking to x and 
y. If the point {x, y) belongs to the interior of S, the oracle should be able to tell 
it eventually. This requirement is used in the proof of the ‘if’ part of the main 
theorem. 

Now suppose that the game has a canonical specification / : (a, 6) — >■ IR. The 
oracle should be able to tell whether 

• f n f-i 

f'{x){f'{x) - 1) ’ ^ ^ 



and if so it must produce a sequence of points Xm G K, to = 1, 2, . . . (by saying 
that the oracle produces the sequence we mean that given to and I the oracle 
produces a rational interval with endpoints Pm,i and qm,i so that for each m 
the intervals with endpoints Pm,i and qm,i, I = 1,2, ... , shrink to Xm) with the 
following properties: the values f'{xm) exist for all positive integers to; 

f{Xm) 

f'{Xm){f'{Xm) - 1) ^ 

Xm converges to xq G [a,b]; the sequence Xm is monotone; the values f{xm), 
f'{xm), /j (xm), and f^{xm) can be produced by the oracle; for each to the 
oracle can produce a sequence Axm,k, k = 1,2, , such that: Axm,k — >■ 0 as 
k — >■ +00 for all positive integers to; for each positive integer to the values Axm,k 
are of the same sign; the values f{xm + Axm,k) can be produced by the oracle 
for all positive integers to and k; for every positive integer m the value of f''{xm) 
is achieved on the sequence Axm,k, k = 1,2, ... , i.e.. 



f'{Xm) 



J~ {.Xm m, k) ifjXm) + / {Xm)AXm,k) 

k^+oo ^{^AXm,k)‘^ 
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Appendix A: Proof of the Main Resnlt 

In this section we prove the ‘only if’ part of Theor. 1, which is the main result 
of this paper. The proof is divided into several stages. 

Assume the converse. Let 0 be a non-mixable game with the set of super- 
predictions S and /C be predictive complexity w.r.t. 0. 

Let 01 and 02 be two games with sets of superpredictions Si and 82 - If S'! 
is the closure of S 2 w.r.t. the extended topology, then 0i and 02 are mixable 
simultaneously and predictive complexity w.r.t. 0i exists if and only if predic- 
tive complexity w.r.t. 02 exists. It is easy to check that if S is not a subset 
of the closure of its finite part S Cl IR^, the game is not mixable and does not 
specify predictive complexity. It is readily seen that a game with the set of su- 
perpredictions S is mixable if and only if a game with the set of superpredictions 
S + {a,b) = {{x + a,y + b) € (— 00 , -boo] | (x,y) € S'} is mixable and the same ap- 
plies to the existence of predictive complexity. Finally, Lemma 7 from [6] states 
that if S n IR^ is not convex, the game does not specify predictive complexity. 
Therefore we assume for the rest of the proof that S is the closure of the convex 
set S n IR^ and S C [0, -boo]^. 

Introductory stage. The case S = [a, -boo] x [c, -boo] is trivial (see Sub- 
sect. 4.1). Let f : (a, b) —7^ IR be the canonical specification of 0. The function 
/ is convex. It follows from Theor. 2 that 
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where D C (a, b) is the set of points where f'{x) does not exist. The requirements 
from Sect. 5 state that the oracle can produce a sequence G (a,b), n = 
1,2, , such that the derivative f'{xn) exists for each n and 



/'(a;„)(/'(x„) - 1) 



(10) 



It can be assumed that this sequence is monotone and converges to xq G [a, b] 
(it is possible that Xq = b = +oo). 

For each m = 1,2,... there is a sequence Axm.k, k = 1,2,..., such that 
Axm,k — >■ 0 as fc — >■ +00 and 



f {Xm AXrti.k) — /(^m) “t” f {Xjn) AXjn.k “t” f (^m) 



Ax'L 



^m{AXm,k) 

( 11 ) 



where for each m = 1, 2, . . . we have I'mit) — >■ 0 as t — >■ 0. 
For each positive integer m let 

Pm = 1/(1 - fixm)) ■ 



(12) 



Since f'{xm) < 0, we have Pm G (0,1). It is easy to check that the line (1 — 
Pm)x+PmV = (1 - Pm)Xm+ Pmf{Xm) is & tangent line to S at {Xm,f{Xm))- A 
simple calculation yields f{xm) = —(1 — Pm)/Pm- 

Single point processes. We need to construct some superloss processes. 
To make the descriptions more intuitive, we will use the term ‘strategy’ in the 
following general sense. Take a superloss process L. We say that L is the loss of 
a strategy 21 that outputs the prediction (L(a;0) — L{x),L{xl) — L{x)) on input 
X for all sequences a: G B*. 

Let 2lm be the strategy that always predicts {xm,f{xm)), kn = 1,2,.... 
We denote the loss of 21™ on a string x G B* by Lm{x). Now we are going to 
construct sequences of positive numbers Sm.k, m,k = 1,2,... such that for every 
m we have Sm.k — >■ 0 as fc — >■ 0 and strategies 21^’*’ . 

The requirements from Sect. 5 imply that we can assume that all Axm.k are 
of the same sign. There are two possibilities for the sign. If they are all negative, 
we let Em.k = —Axm,k and if they are all positive, we let £m,k = Axm,k for all 
positive integers m and k. These two possibilities are illustrated by Figures 2 
and 3, where if™ = (1 - Pm)xm + Pmf{xm) so that (1 - Pm)x + PmV = Em is 
the tangent line to S at {xm, f{xm))- 

Let , ^ 2 ^^ , ■ . . , be the outcomes of Bernoulli trials with the probability 
of success equal to p and let 







(13) 



We will now show that there are functions am{t) such that for every positive 
integer m we have am{t) — >■ 0 as t — >■ 0 and the inequality 




A Criterion for the Existence of Predictive Complexity for Binary Games 



259 







(•!! / \^m,k , / i'rn ptt / \ ~ rn.K / /i \ 

-./ {Xm)^:—npm+£m,k\ n ,/ (Xm) V ~ Pm) 

- 2 \j Pm - 2 



1 _ 

^ Pm rff / \‘^m,k 



H“ ^m {pm,k )£m,k {y npmil - Pm) + npm^ , (14) 



holds with probability at least ^(—1) — 1/ \Jnpm{^ — Pm)- Let us consider the 
two possibilities mentioned above separately. 

The case of negative increments. First suppose that Axm,k are all neg- 
ative and £m,k = —Axm,k- Let 21^ be the strategy that always outputs the 
prediction {xm — £, f{xm ~ £))> £ G (0, Xm)- We denote the loss of 21^ on £c G B* 
by Lm{x). See Fig. 2 for an illustration of this case. 

Take a positive integer m, a string x of length n and £ G (0,Xm — o). It 
is easy to see that if x contains zeroes or more, then L^{x) — Lm{x) > 
-{f{xm-e)- f{xm)){n-l‘'^'>). Let = n(l-pm) + \/npm(l - Pm)- 

Taking into account (11), for each x having l^^m zeroes or more, we get 

Lm{x) Lm {x) ^ £m,k ^^(1 Pm) “t” \/ ^Pm(l Pm)^ 



- I -f{Xm)£m,k + f'{Xm) + ^m{-£m,k)£m^k 



X {npm - \Jnpm{^ -Pm)'^ 



off/ \^m,k I _ / -^ Jr'rn 

■ J \Xm) X ‘£^Pm “t” £m,k \ ^ 

- 2 \ Pm 

£.2 

+ /" (Xm) -Pm) 

^m( ^m,k')^rn,k ^ \/ ‘^Pm (1 Pm ) 



1 Pm 
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In order to estimate the probability of getting at least zeroes among 
the outcomes , ^ 2 ^^ , ■ • ■ , ^ of Bernoulli trials with the probability of success 

equal to p, we will rely on the following proposition, which is a special case of the 
Berry-Essen inequality. A more general statement of the Berry-Essen Inequality 
and a proof may be found in [12], Sect. V.2. 

Proposition 1. Let p G (0, 1) . If Sn = + ^ 2 *^^ + ■ • ■ + and 



F^P\x) = Fr 



Sn - np 
i/np(l -p) 




then 



sup \F^P\x) - <P{x)\ < 

— oo<ai<+oo 



p^ + a-p)^ 

i/np(l -p) 



for all n= 1 , 2 ,.... 

It is easy to see that 



(17) 



(18) 



Pr{>5'n < np — a/ np(l — p)} = Pr 



Sn - np 
i/np(l -p) 




fW(-1) (19) 



and p^ + (1 — p)^ < 1 for all p G (0, 1). Thus the probability to get ln,L or more 
zeroes among is at least ^^(— 1) — l/\/np(l — p). 

The case of positive increments can be considered in a similar way. 
Approaching the contradiction. The following lemma is required (see 
[13], Lemma 3): 

Lemma 1. Let & be a game with the set of superpredictions S and p G (0, 1). If 
for every (u,v) G S the inequality {l—p)u+pv > m holds, then for each superloss 
process L w.r.t. © we get EL{f[PK . . ^n^) > mn, where are 

results of independent Bernoulli trials with the probability of 1 being equal to p. 

It follows from the lemma that 

E/C > ELn, (20) 

— ((1 Pm)^m “t” Pm/(^m)) n , (^1) 



where are results of n independent Bernoulli trials with 

the probability of success equal to Pm- We are going to obtain a contradiction 
by beating this estimate through the use of (14). In order to do this, we will 
construct two strategies, 21 q and 21 q that work as follows. There is a strictly 
increasing sequence of positive integers ni < ri 2 < . . . and for each x of length 
n such that nj_i < n < ni the strategy SIq simulates 21^, while 21 q simulates 
for some and ki. Figure 4 presents a diagram of how the strategies 
work. We denote the loss of 2lo by Lq and the loss of 21 q by L'q. 




A Criterion for the Existence of Predictive Complexity for Binary Games 



261 



2lo , 


length 


1 


1 




rii+i 


2l'o , 


length 


1 


1 ^ 


itt -<\mi 






Fig. 4. The operation of the strategies Fig- 5. The estimate on expectations 

2lo and 21 q 



There is a number of possibilities for the location of the limit point G [a, b] . 
One of the following statements is true: (1) Xq = a, (2) xq G (a, h), or (3) Xq = b 
(note that b may be equal to +oo). 

The first case can be reduced to the others by mirroring S w.r.t. the straight 
line x = y (recall the remark about mirroring at the end of Sect. 5). In the 
second case f"{xn) — >■ 0 since f'{x) is separated from 0 in a vicinity of xq. In 
the third case there are two possibilities, either f{xn) is separated from zero 
and, respectively, f"{xn) — >■ 0, or f'{xn) — >■ 0 and thus /"(a;„)//'(a;„) — >■ 0 as 
n — 1 - + 00 . The first possibility is not essentially different from the second case 
above, while the other one requires a separate treatment. 

We obtain two essentially different situations to consider: 

1. Xm — 1- xq G (a, 6], there is c > 0 such that f'{xm) < — c < 0 for all m = 

1,2,..., and f''{xm) — >■ 0 as m — >■ +oo 

2. Xm b, f{xm) 0, and f'{xm)/f{xm) -)> 0 as m -)> +oo. 

In the first case, we can assume that all predictions made by and 
where m and k are positive integers, belong to some finite square [0,M]^ C IR^, 
where 0 < M < +oo. On the other hand, it follows from the definition (12) that 
Pm converges to po G (0, 1) as m — >■ +oo. 

In the second case we cannot guarantee that all predictions are taken from 
some bounded area. It is possible that xq = +oo. We will rely on monotonicity 
of Xm- As far as the probabilities are concerned, (12) implies that Pm —>■ I as 
m — >■ + 00 . 

The case f'(xm) — ^ 0. We are going to rely on the convergence 

f{Xm)Pm _ fjxm) 

^-Pm f'{Xm) 

as m — >■ + 00 . 

Let 9 = xi + \x\ — /(a:i)|. Suppose that 2to and 21 q have already been con- 
structed for all X of length less than or equal to rn-i. Consider the inequalities: 

(23) 



1 < f'{Xm)£m,kn < 2 , 
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Tii-i < n , 


(24) 


1 ^ <?(-l) 


(25) 


\/npm{^-Pm)~ 2 




(26) 


^m,k — 1 ■> 


(27) 


|ci;m(^m,fc)| f i^m) ■ 


(28) 



They are consistent, i.e., there are positive integers n, m, and k that satisfy these 
inequalities. For example, first let m to be sufficiently large to satisfy (25) and 
(26) and then tune k and n. 

Now take rii, Wj, and ki equal to some n, m, and k satisfying the inequalities 
(recall that the way 2to and operate is illustrated by Fig. 4). It follows from 
(14) that 




holds with probability at least <?(— 1)/2. Let C B"* be the set of all strings 
X of length rij such that (29) holds on x, i.e., inequality (29) holds if and only if 

^(Pm,)^(Pm,) ^ i = 1,2, 

We use L'q to estimate the value IC{x) on a; G and Lq on x G B”* \ Hn.. 
There is a constant C such that 




Lo(®)Prp^.(®) + Y Lq{x)Vx p^,{x) + C (30) 

for all positive integers i, where Prp(a;) denotes the probability of the event 
Ap) Ap) Ap) _ ™ 

SI S2 • ■ • sra ~ ^ 

In order to obtain an estimate in terms of and LmA % consider the 
equality 

Y ^o(®)Prpm,(®) + Y p^.{x) = 

Y - L'o(x))FtpYxx) + Y (Lo{xx) - Lo{x))FTp^,{xx) + 

Y Lo{x)FrpYxx) + Y Mx)Prp^,{xx) , (31) 

where the sum is taken over all x of length rii-i and x of length rii — rii-i 
such that XX € while the sum is taken over all x of length Ui-i and x 
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of length rii — rii-i such that xx G B"* \ . . The formula will hold if we replace 

Lq by Lrrii and L'q by ’ * . Since by construction we have 

Zjq(xx) Ijq(x) = (xx) lijYi^(x) , (32) 

L'q(xx) - L'q(x) = (xx) - (x) (33) 

for all X of length rii-i and x of length rii — rii-i, we get 



^ L'o(x)Prp^.(x)+ Lo(x)Prp^,(x) \ - 



Y LtZ'’’‘*(x)Prp^,(x)+ Y 

/•Prrij /-Pmj ^Ptn^ 



Lrrii (^) Pr Prrii (®) 



< 



T7' ^ { T ( ^P'^i \ Tf ( ^P'^i ^P'^i t^P'^i \\ 

E max (Lq , ^2 ? • ■ • ? ) , Lq , ^2 5 ■ ■ ■ 5 ) ) 

I TH /r f ^P'^i t^P'^i f^P'f^i \ { ^P“^i /^P^i ^P'<^i \\ ^ 

H- E max ? S 2 ? ■ ■ ■ ? i Lrm ? S 2 5 ■ ■ ■ ^ s^i-i )) — 

ELo {er . ^2™^ , • ■ • , ?nr-\ ) + EL™, , • ■ • , ) + 2n,_i . (34) 



The last inequality follows from (27) and the following observation: for all x G B* 
and positive integers m and k, we have |L™(x) — L^’'‘(x)| < e™_fc|x|. 

We will now prove the inequalities ELq )C«r-i) < ^ni-i and 

EL™, Indeed, while the length is less than rii-i, 

the strategies 21 q and 21™, output predictions (x, f(x)), where x G [xi, x™,]. Since 
S is convex, all such points (x, f(x)) lie above the tangent line (1— p™,)x+pm,7/ = 
(1 - p™,)x™, + p™,/(x™,) but below the parallel line (1 - p™,)x + p™,j/ = 
(l-p™,)xi+p™,/(xi) (see Fig. 5). Hence (1 -p™,)x + p™,/(x) < (l-p™,)xi + 
Pmif(xi) < 9. This provides us with the desired inequalities. 

Taking into account (26), we get 



E/C 



x£En, 



+ E ^>n,(a;)Prp^^(x)+^E^,/ E +C . 



Comparison with (20) and (29) yields 

/ 



4 V / i^Tn)Pn 



(35) 



1 Prrii „ / Pmi(^ Prrii) ^( ^ 

- 3 p™, + 1 / — :: — < C . 



\ fi^rrH)Pn 



(36) 



The left-hand side tends to infinity as i tends to -l-oo and this is a contradiction. 

The case f'(xm) 7^ 0. The proof in this case can be carried out in essentially 
the same way. It can even be significantly simplified since all predictions made 
by 21™ and 21^’'“ can be taken from a bounded set. 
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Abstract. In the Full Information Game the player sequentially selects 
one out of K actions. After the player has made his choice, the K payoffs 
of the actions become known and the player receives the payoff of the 
action he selected. The Gain-Loss game is the variant of this game, where 
both gains from [0, 1] and losses from [0, 1] are possible payoffs. This 
game has two well studied special cases: the Full Loss game where only 
losses are allowed, and the Full Gain game where only gains are allowed. 
For each of these cases the appropriate variant of Freund and Schapire’s 
algorithm Hedge [7,3] can be used to obtain nearly optimal regrets. Both 
of these variants have an immediate adaptations to the Full Gain-Loss 
game. However these solutions are not always optimal. 

The first result of this paper is a new variant of algorithm Hedge that 
achieves a regret of 0{V^n K y^Gj Lj) for the Full Gain-Loss game, 
where j is the index of one of the actions in the game, Gj , the total gain 
of j, is the sum of all the positive payoffs that the jth action had in the 
game, and Lj is the absolute value of the sum of all its negative payoffs. 
In addition, the new algorithm achieves matches the performance of the 
known Hedge algorithms in the special cases of gains only and losses 
only. 

The second result is an application of the new algorithm that achieves 
new upper bounds on the regrets of the original Full Gain game and Full 
Loss game. The new upper bounds are a function of a new parameter. 
The third result is a method for combining online learning algorithms 
online. This method yields an 0( min (^Lopt In A , yj {T — Lopt) InK)) 
upper bound on the regret of the the Full Loss game, and an 
0(min(^Gopt InK , ^ {T — Gopt) In A)) upper bound on the regret 
of the the Full Gain game. 



1 Introduction 

The Full Information Game was defined by Freund and Schapire in [8]. On each 
day (or other time step), an adversary assigns K payoffs to K given actions. 
Then, without knowing which payoffs the adversary has chosen, the player has to 
select one of the K actions. After the player has made his (possibly randomized) 
choice, the payoffs of all the actions become known, and the player receives the 
payoff of the action he had chosen. After T days, the expected total payoff of 
the player is the sum of his expected payoffs over all the days of the game. 
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We compare the performance of the player - the online algorithm, with the 
performance of an adversary who has the advantage of being able to make his 
choice after seeing all the future T payoff vectors. However, our adversary is 
limited to choosing only one action, and then this same action becomes his choice 
for the entire game. The goal of the player is, therefore, to minimize the regret 
which is the difference between his total expected payoff and the total payoff of 
the best action in the game (over all the days in the game). The adversary’s goal 
is to maximize the regret. 

We consider three variants of this game. The Full Loss Game is the variant 
of the Full Information game in which the adversary is limited to assigning losses 
from [0, 1] only, i.e., payoffs from [—1, 0] only. The Full Gain Game is the variant 
of the full information game where the adversary is limited to assigning only zero 
or positive payoffs from [0,1] (i.e., gains). For the first result of this paper we 
consider the Full Gain-Loss Game, which is a variant of the Full Information 
game where any payoff from [—1,1] is possible. In addition to being theoreti- 
cally interesting, allowing both gains and losses is necessary for modelling many 
practical scenarios including financial investments and the stock market. 

Algorithm Hedge and its proof technique by Freund and Schapire in [8] 
presents a generalized version of the online prediction methods. These methods 
were developed in steps, starting from Littlestone and Warmuth’s [11] Weighted 
Majority algorithm and Vovk’s [12] Aggregating Strategies. The Weighted 
Majority algorithm left a parameter to be tuned. A similar but more restricted 
game than the Full Information game was considered by Cesa-Bianchi, et al. in 
[5]. They took a major step and showed that the Weighted Majority technique’s 
parameter can be tuned yielding an upper bound on the regret that is of the 
order of the square root of the total payoff of the best action. 



1.1 Upper Bound for the Full Gain-Loss Game 

The first result of this paper focuses on finding and investigating the update 
rules that are best for the case of the Full Information Game where both gains 
and losses are allowed. 

Freund and Schapire’s [8] algorithm Hedge(/3) has an upper bound of 
on the regret of the Full Loss game, where Lmm is the total loss 
of the action that has a minimal total loss (maximal total payoff) in the game. 
Algorithm Hedge(a) that appeared in Auer et al. [3] is the variant of Hedge for 
the Full Gain game. By a trivial transformation of the range of the losses or the 
gains, both Hedge(/3) and Hedge(a) can be applied to the Full Gain-Loss game, 
yielding analogous upper bounds on the regret. Let us consider using Hedge (a) 
via the transformation where any Xi(t), a payoff of action i in iteration t, is 
substituted by ^{1 Xi{t)). The Hedge technique’s upper bound on the regret 

of the Gain-Loss game is 0( min^g^i 2 ,..../c} Vln K\Jt Xi (t) ), where T 

is the number of days in the game. Using Hedge(/3) with the analogous trans- 
formation yields an upper bound of 0( minjg{i 2 ,...,*:} Vln AtVt — ^i(f) ) 
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on the regret of the Gain-Loss game. This result means an 0{^/\a. K\/T ) upper 
bound in the case where we used Hedge(a) but minjg{i 

positive, or if we used Hedge(/3) but minjg^i 2 ,.,,,*:} turned out to be 

negative, or in any case where min^g^i 2 ,.,,,/c} small. 

In this work we present algorithm GL. This algorithm, is a variant of Hedge 
that has a new generalized update rule. 

If we are given in advance Gqpt + .^opt , an upper bound on 
Gopt + Gqpt, where OPT is an action that achieves the maximal total payoff, 
then GL(a), the parameterized version of GL, can be tuned to achieve a regret 
of O ( \/ln iL \/G opt + Gqpt ) (Gorollary 2 section 3.3). Let Gj be the sum of all 
the positive intakes (gains) of the j’th action in the game, and let Lj be the 
sum of all the negative intakes (losses) of the j’th action in the game. The more 
general version of the last result states that for any action aj , if an upper bound 
Gj + Lj > Gj + Lj is given in advance, then GL(o;) can be tuned to achieve a 

regret of 0{Vln K^^Gj + Lj) (Gorollary 1 section 3.3). 

For the case where no prior knowledge is assumed, GL.l, the variant of the 
new algorithm that uses the known doubling technique (used in [3], [4] and [5]) 
guarantees that there exists an action aj such that the regret is 
0(-\/ln Ky^Gj + Lj) . This upper bound matches the lower bound, that is men- 
tioned later, up to the choice of actions in the regret expression. 

The parameter T of the known results can be substantially larger than the 
parameter Gj+Lj, that the results of this paper depend on. For each day, 1 is the 
theoretical upper bound on the gain or loss of an action per day, and therefore 
may be far apart from the common actual gain and loss value. For Gj + Lj to be 
equal to T, this maximal gain or loss of 1 per day has to be the payoff of action 
Qj every day. The constants of the new result are slightly better than those of 
the known results as the above transformations add a y/2 multiplicative factor 
to the upper bound. 

The lower bound of G{\/TlnK) in Gesa-Bianchi, et al. [5] implies an 
G{^/ L^i^ln K) lower bound on the Full Loss game, where Lmin is the total loss of 
the action that achieves the minimal total loss. It also implies an l7(-\/Gmax In K) 
lower bound on the Full Gain game, where G^ax is the total gain of the action 
that achieves the maximal total gain. These lower bounds can be combined to 
show that the regret of the Full Gain-Loss game has an l7(\/ln iF-^/Gmax + L^±^) 
lower bound. The significance of this lower bound is that it shows that the regret 
of the Gain-Loss game cannot be a function of the optimal (maximal) net payoff 
of the best action in the game (i.e., cannot be Gqpt — Gqpt)- In fact, the propor- 
tion between the regret and the optimal net payoff is unbounded. For example, 
the optimal net payoff may be zero, while the minimal Gj -I- Lj for any action 
aj, can be as large as T. 

In the Full Gain-Loss game, the a priori range of payoffs is [-1,1]. Both the 
Full Gain game and the Full Loss game are special cases of the Full Gain-Loss 
game. In both cases if it is known in advance that a special case is about to 
take place, the appropriate Hedge algorithms can be applied yielding nearly 
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optimal regrets for these special cases. In particular, if we know in advance that 
all the payoffs will be gains, Hedge(a) can be applied yielding the upper bound 
of O(v^GrnaxdmK) on the regret. If we know in advance that all the payoffs will 
be losses, Hedge(/3) can be used to achieve an upper bound of 0(-\/imin InFf) on 
the regret. 

The new algorithm GL does not require prior knowledge of whether the 
coming game would be a special case or not, and yet if a special case takes 
place, GL achieves the known regret of the special case. That is, in the special 
case where all the payoffs turn out to be from [0,1], GL.l, the doubling version of 
GL, achieves an upper bound of O(v^GiiiaxTn^) on the regret, and in the special 
case where all the payoffs turn out to be from [-1,0], GL.l achieves an upper 
bound of 0(-\/Lmin In K) on the regret. The constants in these cases match those 
of Hedge. If it turns out that the game is not a special case, the same algorithm 
GL.l still achieves the new and improved general-case result and guarantees that 
there exists an action aj such that the 0{VhiK y^Gj~+l^) upper bound on the 
regret holds. 



1.2 New Upper Bounds for the Full Information Game 

The second result of this paper is an application of algorithm GL that achieves 
new upper bounds on the regrets of the Full Gain game and Full Loss game. 
While the new algorithms and results are based on the GL technique which in 
turn is based on the Hedge technique of Freund and Schapire [8] , they require 
the addition of a new technique. The new upper bounds are a function of a 
new parameter. Only in its worst case does this parameter become equal to the 
parameters of the original upper bounds - the optimal total gain and optimal 
total loss. Given an online algorithm A, for any arm Oj, we define Df, the 
exceeding value of ai with respect to A, as ^ ^ gi(t)>gA{t) ~ 9A{t)) where 
gA{t) is the expected gain of the online algorithm A in iteration t, and gi(t) is 
the gain of action Oi in iteration t. For the Full Gain game, algorithm GGL (from 

section 4) achieves a regret of 0[\/\yiK ^ ), where Oj is one of the actions 
j G {1,2, ...,iL} (Theorem 2 and Theorem 3 from section 4). According to this 
new update rule, we judge an action not by its absolute accumulative payoff, but 
by the offsets it has from the payoff of the mixture of actions that we expected 
to be best, that is, by the ’’surprises” this action gave us from what we believed 
to be a good choice. It is interesting to see that such a rule is enough to ensure 
the ability to compete with the optimum. 

We can show that in the Full Gain game for all actions j € (1,2,..., AT}, 
< Gopt and therefore depending on this new parameter could be better than 
depending on the original Gqpt parameter However, our constants are slightly 
weaker than those of the known Hedge upper bounds on the regrets from [8] - 
we have an additional -\/2 multiplicative factor on our upper bounds that the 
known results do not have. 

Theoretically, the interesting point about these new results is that they show 
the first relation between the ability to approach the performance of the best ac- 
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tion and how different the good actions are from the performance of the mixture 
of actions that we expected to be best. Intuitively, it seems that the new results 
are more explicit in showing that the less adversarial the input (of payoffs) is, 
the better the new algorithm can perform. Our hope is that this can lead to 
upper bounds that guarantee a good regret in the worst case, but in addition, 
know how to capitalize on the good fortune of better behaved sources of payoffs, 
and have an improved performance in those cases. Our preliminary experiments 
support this hope. 

One can conduct similar discussions about the Full Loss game and the Full 
Gain-Loss game, and show that they have the same type of 0{-\/\yiK ^ ) 
upper bound on the regret. 

1.3 Combining Online Learning Algorithms Online 

In [8] Freund and Schapire generalized the Weighted Majority technique of N. 
Littlestone and M. K. Warmuth from [11], and presented algorithm Hedge(/3) 
for the Full Loss game Let Hi denote the doubling variant of algorithm Hedge (/3), 
i.e., the variant of Hedge(/3) that tunes j3 using the standard doubling technique 
that was used in [3], [4] and [5] 

Algorithm Hedge(a) of Auer, Cesa-Bianchi, Freund and Schapire’s [2] is 
designed for the Full Gain game (discussed in section 4), however via an im- 
mediate transformation from losses to gains it can be applied to the Full Loss 
game. In this transformation we ’’translate” any loss / € [0, 1] from the real input 
of the Full Loss game to a ’’simulated gain” g = 1 — 1. Hedge(o;) can then run 
on these simulated gains. Let Hu be the variant of Hedge(a) that uses this 
transformation and tunes a using the standard doubling technique. 

By the Freund and Schapire’s generalized Weighted Majority technique of [8], 
Hi achieves an upper bound of 0{^jLopt In A ) on the regret of the Full Loss 
game. An upper bound of 0{^/{T — Lopt) In A ) on the regret of Hu follows 
from the analysis of Hedge(a) that appears in Auer, Gesa-Bianchi, Freund and 
Schapire’s [2]. 

In this work we construct an algorithm that guarantees a regret that is upper 
bounded simultaneously by both of the above upper bounds, that is, a regret that 
is upper bound by O ( min ( ^jL^pt , — L^pt ) • Vln K ) . The motivation for 

such a result is the basic online and online learning fields competitive approach 
where once certain performances are possible, we want an online algorithm with 
a performance that is not much worse than any of them. 

When considering the goal of an O ( min ( y/Lopt , \/T — Lopt ) • Vln A ) upper 
bound on the regret of the Full Loss game, the first, but yet wrong, impression 
is that this goal can be met by the immediate following construction. Let us 
consider combining algorithms Hi and Hu in the obvious way - using a master 
Hedge algorithm that has a set of two actions, which are algorithms Hi and Hu- 
This master algorithm would achieve, up to the regret, the performance of the 
one out of the two that turns out to be the best in hindsight. This bootstrapping 
idea doesn’t work, since we have to choose what type of master Hedge algorithm 
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to use, the options are a Hedge algorithm that is a Hi type or one that is a Hu 
type. This a priori choice dictates the upper bound on the regret of the master 
algorithm. If we chose the first option of a Hj type of master algorithm, we 
get an upper bound that depends on Otherwise (if we chose the second 

option) we obtain an upper bound that depends on — Lopt- 

In this work a new technique for combining online learning algorithms is 
presented. With this technique algorithms Hj and Hu are combined into a new 
algorithm. This new algorithm is called algorithm Zapping, and it achieves an 
upper bound of O ( min ( ^JTjopt , \/T — Lopt ) • Vln K ) on the regret of the Full 
Loss game (Theorem 4 section 5). The new technique also yields an analogous 
result for the Full Gain game. 

2 Notation and Terminology 

The Full Information Game was defined by Freund and Schapire [8]. All three 
versions of the Full Information game are carried out in T iterations. Each iter- 
ation is done as follows: 

1. The adversary chooses x{f) G [—1, 1]^, a vector of K payoffs, for the current 
iteration t. The Fth component Xi{t) is the payoff that is associated with 
action i. This is the payoff that we will receive if we choose the z’th action 
in this iteration. 

2. Without knowing the adversary’s choice, we are asked to choose one action 

G {ai,tt2,... ,a/c}. 

3. The payoff vector x(f) becomes known. 

When making our choice in step 2, we, the online algorithm, are allowed to use 
randomization. We can also use the information we have accumulated in former 
iterations. 

There are three variants of the Full Information game. For any choice of 
payoffs from [—1, 1] that is made by the adversary, if Xi{f), the payoff of action 
Qi at time t, is positive it is called a gain, and it has an additional notation gi(t). 
If Xi{t) is negative we denote the absolute value of it by h{t), and call it a loss. 
Xi denotes the total payoff of action i throughout the game. 

Gi, the total gain of action i throughout the game, is the summation of gi(t) 
over all the iterations where Xi(t) is positive. 

Li, the total loss of action a^, is the summation of li{t) over all the iterations 
where Xift) was negative. Trivially, Xi = Gi~ Li, and may be positive, negative, 
or zero. 

No stochastic assumptions are made as to how the adversary generates the 
payoffs. We compare our performance to Aqpt the optimal payoff which is the 
performance of the best action OPT. Aopt is defined as Xi maximized over all the 
actions i G {1,2, In other words, we compete against Aopx, the maximal 

total payoff of any consistent choice of action in this game. Our performance 
measurement, the regret of an online algorithm A, is the expected total payoff 
of A minus the optimal payoff, maximized (sup) over all possible assignments of 
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payoffs in the game. The regret of the game is the regret of an online algorithm 
A, minimized (inf) over all online algorithms A. 



3 Full Gain-Loss Game with Prior Knowledge 

3.1 The New Update Rule 

Let us consider Hedge(/3) for the Full Loss game. In algorithm Hedge(/3), initially 
equal weights are given to the K actions. At each iteration Hedge chooses an 
action according to a probability distribution that is obtained by normalizing 
the weight vector. At the end of each iteration after l{t) the loss vector of the 
actions, becomes known. Hedge(/3) updates the weight of each action by multi- 
plying it with a multiplicative factor. The most discussed multiplicative factor 
for Hedge(/3) is, for arm ai : 

When looking for a variant of Hedge for the Gain-Loss game, the immediate 
choice of a multiplicative update factor for the weights is (3^ for the case where 
the payoff is a loss /, and for the case where the payoff is a gain g. This 
choice satisfies the naive intuition that a gain of 1 plus a loss of 1 should have 
the same effect as a net profit of 0 that comes from no loss and no gain. In other 
words this intuitive update rule has the quality that all actions with identical 
net profit are given the same weight. However, it turns out that in the case of the 
Full Gain game, the Hedge proof technique cannot be successfully applied to this 
choice of update rule. Moreover, the above naive intuition had to be abandoned 
before the new update rule could be found. 

In this paper a new generalized update rule is suggested for the Gain-Loss 
game. In this rule the multiplicative factor for the weights is (1 -I- a)® when 
the payoff is a gain g, and (1 — aY when the payoff is a loss 1. In the general 
case, the new rule does not necessarily give equal weights to actions that have 
identical cumulative payoffs. Rather, between two actions that have an identical 
net profit, there would be a higher weight for the action Oj that has the smaller 
Gi{t) + Li{t), which is intuitively the one with the ’’smaller circulation”, or the 
one that is ’’more stable”. This quality corresponds to the conclusion of the worst 
case risk analysis that says that if two investments have the same net profit the 
one with the smaller Gi{t) + Li{t), is less risky, and is therefore preferable. We 
call the algorithm that uses the new update rule GL. For this update rule a 
variant of the Hedge proof technique by Freund and Schapire [8] can be applied, 
yielding the new upper bounds of algorithm GL. 



3.2 Algorithm GL(a) and Its Analysis 

Algorithm GL(o;), the parameterized version of GL, is described in Figure 1. If 
the parameter a is tuned using a given in advance upper bound Gqpt + Lopx on 

Gopt + Lqpt, GL(q;) achieves an upper bound of o(^\J (Gqpt + Lopt) In K ^ on 
the regret. 
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Algorithm GL(a) 

Parameters: A real number 0 < a < 1. 

Initialization: 

For i = 1,2, K set the initial weight of action Oj, Wil := 

Repeat for t = 1,2, ... until game ends: 

1. Choose action it according to distribution P{t), 

where Pi{t) = — i > and where w* is the 

^3 = 1 

weight of action i at the beginning of iteration t. 

2. Observe the reward vector x{t) and receive payoff Xi^(t). 

3. Update the weights: 

For any action that received a positive or zero payoff 
(a gain gi(t)) in the current iteration: 

= re* • (1 + 

For any action that received a negative payoff 
(a loss k{t)) in the current iteration: 

wl+^ = wl ■ (1 - 



Fig. 1. Parameterized version of algorithm GL 

Algorithm GL, like algorithm Hedge, is actually a family of algorithms, as 
the proof technique allow some freedom in the choice of the update rule. In GL 
the multiplicative update for a loss I can be any function Ua{l) s.t. 

(1 — aY < Ua{l) < 1 — al, and the multiplicative update for a gain g can be 
any function Ua{l) s.t. (1 + a)® < Ua{g) < 1 + ag. The linear rule 1 + ax is 
also a member of the Hedge family of update rules. However for the rest of the 
Hedge family of update rules, we can’t show the performance of the GL family 
of update rules. 

Theorem 1. For all 0 < a < 1, any action aj s.t. j € {1,2, ...A} and any 
choice of payoffs from [—1,1] that is made by the adversary, the expected total 
payoff of algorithm GL{a) is at least, 

Gj ■ ln(l + a) + Lj ■ ln(l — a) — In A 

^GL(a) > ■ 

a 

The proof of Theorem 1 is a variant of Freund and Schapire’s proof of the 
performance of algorithm Hedge from [8], that includes an adaptation to our 



case. 
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3.3 Tuning the Parameter Using Prior Knowledge 

Corollary 1. For any choice of payojfs from [—1, 1] that is made by the adver- 
sary, and any action aj s.t. j € {1,2, K} , given an upper hound 
G,- + Li > Gi + Li, if the parameter a is set to he V2 i n k then the 

■' J - J J’ ■> ^ 1/GJ+L3+V2 \nK 

expected total payoff of algorithm GL{a) is at least, 



-GL{a) ^ 



> Gi - Li - 



'2lnK ■ 



'Gj + Lj + lnK 



In particular, this inequality is valid when action Oj is OPT, the action that 
achieves the maximal total payoff over all actions in the game. 



4 New Upper Bound for the Full Gain Game 

In the Full Gain game, in any iteration t the adversary chooses a vector of ’’input” 
gains g{t) G [0, 1]^. In algorithm GGL(o;) we apply a new transformation on 
these ’’input” gains. Then we feed the transformed values to GL(a). For this 
transformation we refer to the expected payoff of our online algorithm GGL{a) 
in iteration t, which is Xon{t) = P{t) ■ g{t). 

Theorem 2. In the Full Gain game, for any choice of payojfs from [0, 1] that 
is made by the adversary. If algorithm GGL{a) is given in advance 

an upper bound on the exceeding value of OPT, where 
OPT is the action that achieves a maximal total gain over all the other actions 
in the game, then if the parameter a is set to be Vin j c ^ then algorithm 

GGL{a) achieves, 

^GGL(a) > Gopt ~ 2 Vh^ ' - In K . 

The proof of Theorem 2 is in Appendix A . It is based on the proof of Theorem 
I and Gorollary I (the analysis of algorithm GL(o;)) which are based on the 
technique of Freund and Schapire’s proof in [8] . However this analysis requires 
the addition of a new technique. 

Let us define algorithm GGL.l as the doubling variant of GGL that is con- 
structed using the known doubling technique (used in [3], [4] and [5]). 

Theorem 3. In the Full Gain game, for any choice of payojfs from [0, 1] that 
is made by the adversary, there exists an action aj s.t. algorithm GGL.l the 
doubling variant of algorithm GGL achieves, XacL.i > 

Gopt - 5.7V2hdL • - 1.31n Klog2(i:>f'^-^ i) - 7.31nK. 

Lemma 1. In the Full Gain game, for any choice of gains that the adversary 
makes, for any action aj s.t. j G {1,2, ...,K}, < Gqpt- 
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Proof of Lemma 1 



For any action Oi: By definition 



D 



GGL(a) 



- Et s.t. Si(i)>^GGL(c) - ^GGL(a)(^))- 



Since in the Full Gain game -Yggl(q;)(^) is always positive or zero, 



nGGL(a) 

n/QPT 



— Et s.t. Si(i)>^GGL(c) — Et=l ~ — ^0 



Algorithm GGL(o;): 

Parameters: A real number 0 < a < 1. 

Initialization: Initialize algorithm GL(o;). 

Repeat for t = 1, 2, ... until game ends: 

Apply algorithm GL(a) to iteration t in the following way: 

1. Let GL(a) produce a probability vector P{t), 

and choose an action for iteration t according to P{t). 

2. Galculate the expected payoff of iteration t, Aon(t) = P{t) ■ g{t). 

3. For any action i, calculate yi{t) := gi{t) — Aon(t)- 

/* Note that for all iteration t and action i, yi{t) G [—1,1]- */ 

4. Let GL(a) receive the vector y{t) as the vector of payoffs 
for iteration t. 

5. Let GL(a) update its weights according to y{t). 



Fig. 2. Parameterized version of algorithm GGL 



5 Combining Online Learning Algorithms 

Let Hi denote The doubling version of algorithm Hedge(/3) from Freund and 
Schapire’s [8]. Let Hu denote The doubling version of algorithm Hedge(a) from 
Auer et al.’s [3]. algorithm Zapping is described in figure 3. This algorithm can 
be viewed as a master algorithm that in each iteration let one out of the two 
algorithms, Hi or Hu, handle the current iteration. In any run of algorithm 
Zapping the two algorithms Hi and Hu split the sequence of iterations between 
them such that each algorithm handles only a (not necessarily continuous) sub- 
sequence of the iterations, and completely ignores the rest of the iterations. 

In each iteration t, one and only one of these two algorithms is the active 
algorithm, i.e., is the algorithm that is used for choosing the action, and has its 
weights and variables updated at the end of the iteration. The algorithm that did 
not become the active algorithm in iteration t is not considered when choosing 
the action in iteration t, and its weights and variables remain unchanged as if 
iteration t never took place. 

Let Si and Su denote the two subsequences that are handled by algorithms 
Hi and Hu respectively. These subsequence of S are not necessarily continuous. 
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rather, each of them is consisted of a subset of the iterations of S such that the 
original order between the iterations is preserved. The subset of the iterations 
that are in Sj and the one of Sn are disjoined and together they cover all the 
iterations of S. 

The decision of how to split S into the two subsequences Sj and Su is made 
online. A simple deterministic rule is used for deciding to which of the two 
subsequences the current iteration belongs, that is, which algorithm should be 
the active algorithm in the current iteration. This rule is not the greedy one - 
not necessarily the algorithm that had the best performance (or the best regret) 
up to this iteration will be the active algorithm in the current iteration. Rather 
the algorithm {Hi or Hjj) that had the smallest ’’Hedge technique upper bound 
on the local regret” so far will be the active algorithm in the coming iteration. 



5.1 Definitions 

The sequence of iterations 1 through t from S = 1,2,... ,T is denoted S{t). 
Si{t) denotes the subsequence of S{t) that contains only the iterations where 
algorithm Hj is active, and Su{t) is analogous. 

Let us consider any given choice of losses that is made by the adversary. For 
S', a subsequence of iterations, opt{S') denotes the action that is optimal when 
only the iterations of S' are considered, i.e., the action that achieves, 

Lopt(S'){S*) the cumulative loss of action opt{S') over the subsequence S* is, 

^opt{S')i^ ) “ EtgS* ^opt(S')(f) ■ 

For a subsequence of the iterations S', Lopt{S'){S') is the optimal loss with 
respect to S' . 

Let At denotes the active algorithm of iteration t. 

l{At, t) is the loss of the action that At chooses in iteration t, i.e., the loss that 
is incurred in iteration t. Since Hp and Hn are randomized algorithms l{At,f) 
is a random variable. 

The local loss of algorithm A over the subsequence of iterations S' is, La{S') = 

The local regret of algorithm Hj at the beginning of iteration t is the difference 
between the local loss of algorithm Hj over Sj{t — 1) and the optimal loss with 
respect to Si{t — 1). 

Analogously, the local regret of algorithm Hjj at the beginning of iteration t 
is the difference between the local loss of algorithm Hu over Sn{t — 1) and the 
optimal loss with respect to Su{t — 1). 



5.2 The Rule That Algorithm Zapping Uses 

We will use the following function for the description of the upper bounds on 
the regret: <5(A) A 5.7 . y^lnK ■ X + h\K ■ log 2 {X + 1) + 6.9 In A. 
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The following upper bounds follow from Freund and Schapire [8]. The local 
regret of Hj at the beginning of iteration t is upper bounded by 
Qiy\/ Lopt(Si{t-i)){Si{t — 1)) ). The local regret of Hjj at the beginning of itera- 
tion t is upper bounded by Q(^\Sn{t - 1)| - iopt(S/j(t-i))(>5'77(i - 1)) )• 

At each iteration, the Zapping algorithm chooses the current active algo- 
rithm. The goal behind this choice is to get the above upper bounds on the local 
regrets that Hj and Hu will have at the end of the game to be very close. As 
can be seen in the analysis in the next section, achieving this goal is enough to 
obtain this paper’s new upper bound. 

The rule that the Zapping algorithm uses in order to select the active algo- 
rithm for the current iteration is: The Zapping algorithm lets the algorithm that 
had the smaller upper bound on the local regret so far, be the active algorithm in 
the current iteration. Our analysis shows that this local rule achieves the above 
global goal of close local regrets at the end of the game. This is not immediate 
since that the locally optimal action can change from one iteration to the next. 

Theorem 4. In the Full Loss game, for any K > 2 number of actions, any T 
number of iterations, and any possible assignment of losses that is made by the 
adversary, the regret of algorithm Zapping is upper bounded by 
2 • min {q{^ L opt),Q{y/T^^T^t )) + 1, 
where Q{X) = 5.7 • Vln K ■ X -\-\tlK ■ log 2 {X -|- 1) -|- 6.9 In AT. 



Algorithm Zapping 

Initialization: Initialize algorithms Hj and Hjj. 

For t= 1,2,... until game ends do steps 1 to 3 : 



1. Calculate the local optimal losses : 



optimal dos s{H j , t — 1) 



min 



E hit') 



optimalJoss{H 1 1 , t — 1) 



min 



E hit') 



2. If optimalJoss{Hi,t — 1) 

< \Siiit — ^)\ — optimalJoss{Hii,t — 1), Then let algorithm iL/ be 
At - the active algorithm in iteration t. 



Otherwise, let algorithm Hjj be At. 

3. Apply algorithm At to the current iteration t, i.e., let At choose the 
action for the current iteration t, and then feed back the loss vector 
l{t) to At- Let At update its weight accordingly. 



Fig. 3. 
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6 Open Questions 

The Partial Information Game (the adversarial multi-armed bandit problem) was 
defined in Auer, Cesa-Bianchi, Freund and Schapire’s [3]. This game is similar 
to the Full Information game except that in this case at the end of the iteration 
(step 3) only the payoff that was associated with the action that the online 
algorithm had chosen becomes known. 

In Auer et al.’s [3] an upper bound of 0{y/KlnKGopt ) is shown on the regret 
of the gain variant of the Partial Information game. In Allenberg-Neeman, Auer 
and Cesa-Bianchi’s [2] an upper bound of 0{K\nK ^jLopt ) is shown for the loss 
variant. It remain an open question whether the results of this paper, can be 
extended to the Partial Information Game. 

Acknowledgement. We would like to thank Amos Fiat for advising to work 
on the gain-loss question, and for constructive suggestions. We would also like 
to thank Nicolo Gesa-Bianchi, Yishay Mansour and Leah Epstein for helpful 
discussions. 
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Appendix A 



Proof of Theorem 2 

Let wj denote the weight of action i at the beginning of iteration t. 

Like in the original Weighted Majority [11] and Hedge[8] technique, we define 
W{t) as conduct the analysis by showing an upper and a lower 

bound on . 



An Upper Bound on : 

GL(a) receives yi{t) as the payoff of actions at in iteration t. The upper 
bound is developed in the same way as in the proof of Theorem 1: 



In 



W{t + 1) 
W{t) 



k 

< ■ Vi{t) 

i=l 



XGL(a){t) is the payoff of algorithm GL{a) in iteration t. It is always a gain 
since unlike the simulated values that GL(a) receives (j/i(t)), the real values of 
the actions in this game are in [0,1]. Let us substitute the simulated values with 
the real ones: 



in - XGGL{a){t)) 

'' ^ i—1 

where Pi{t) is the probability by which algorithm GL(a) chooses action ai 
on iteration t. 



In 



W{t+l) 

W{t) 



< a ■ {XGGL(a){i) - XGGL(a)i't)) ~ 0 



In 



W{T+1) 

W^(l) 



^In 



W{t+1) 

W{t) 



< 0 



A Lower Bound on : 

For any action j G 1,2, ...K, 

W{T + 1) > (1 + • (1 - • — 

K 

where M is the set of iterations where gi{t) > Aon(t) and therefore yj{t) > 0 
("More”). 

L is the set of iterations where gi{t) < X^nit) and therefore 
yj{t) < 0 (’’Less”), 

and by the GL algorithm the loss in these iterations is j yj (t) j . 
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Taking In and noticing that W{\) = 1, gives us: 

lnW^(T+ 1) > Y^taMVA^) ' ln(l + a) + Etei 1% W I ' ln(l - a) - In AT . 
Combining the above two bounds yields: 

0 > EteMJ/j W • ln(l + a) + J2t(^L 1% Wl ■ ln(l -a)-lnK . 

Dividing by a, and applying the approximation from claim 1 yields: 

0 > + \vM ■ '^-InK 

0>EteM%W- (l-f) -Eieil^WI- (l+2(I^) 

> EtaMvAt) - EtdL \yjA)\ - 2(!^(EteM2/iW + E^l AjAM) - ^ ■ 

Let us substitute the simulated values with the real ones, 

% W = aA^) - ^GGL{a){t)- 

Note that EteM 9ji'^)~^EjeL 9j(^) ~ (where Gj is the total gain of action 
j)> and EteM ^GGL{a){t) + EteL^GGL{a){t) = XcGL(a)- 
Hence, EjgM VjA) ~ Etei l2^i(^)l = ~ ^GGL(a) 

We obtain that for any action aj, 

Xggl{c) > Gj - i^^)(Et(iMy 3 A) + EtsL l2^j(^)l) - gA ■ 

Let us consider the case where action aj is OPT: 

If ^GGL(a) ^ Gqpt then Theorem 7 holds. Otherwise, Xqqi^(^^) < Gqpt and 
therefore the summation of j/opx(t) = 9 oftA) — -^GGL(a)(f) over all the iterations 
where XcGLia) < Gopt is larger than the summation of 
yoprA) = goppit) — Xon{t) over all the iterations where XGGL(a) > Gqpt- 
We obtain. 



X, 



GGL{a) > Gqpt — 



2(1 -a) 



2 'y ( 2/opt (f) 



In K 



teM 



Et&M yop-rit) is the summation of the difference 
VoptA) = 9optA) ~ XGGL{a)(t) Over all the iterations t where 

5opx(t) > XGGL(a){t)- Hy definition it is the measurement the exceed- 

ing value of i with respect to GGL{a). Hence, 



XGGL{a) > Gopx — 



2a p,GGL(a) In AT 

0/1 C^OPX • 

2(1 — a) a 

Substituting a according to the theorem completes the proof. 
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Abstract. When applying aggregating strategies to Prediction with Ex- 
pert Advic e, the learning rate mus t be adaptively tuned. The natural 
choice of y^complexity /current loss renders the analysis of Weighted Ma- 
jority derivatives quite complicated. In particular, for arbitrary weights 
there have been no results proven so far. The analysis of the alternative 
“Follow the Perturbed Leader” (FPL) algorithm from [KV03] (based 
on Hannan’s algorithm) is easier. We derive loss bounds for adaptive 
learning rate and both finite expert classes with uniform weights and 
countable expert classes with arbitrary weights. For the former setup, 
our loss bounds match the best known results so far, while for the latter 
our results are new. 



1 Introduction 

The theory of Prediction with Expert Advice (PEA) has rapidly developed in the 
recent past. Starting with the Weighted Majority (WM) algorithm of Littlestone 
and Warmuth [LW89,LW94] and the aggregating strategy of Vovk [Vov90], a vast 
variety of different algorithms and variants have been published. A key parameter 
in all these algorithms is the learning rate. While this parameter had to be fixed in 
the early algorithms such as WM, [CB97] established the so-called doubling trick 
to make the learning rate coarsely adaptive. A little later, incrementally adaptive 
algorithms were developed [AG00,ACBG02,YEYS04,Gen03]. Unfortunately, the 
loss bound proofs for the incrementally adaptive WM variants are quite complex 
and technical, despite the typically simple and elegant proofs for a static learning 
rate. 

The complex growing proof techniques also had another consequence: While 
for the original WM algorithm, assertions are proven for countable classes of 
experts with arbitrary weights, the modern variants usually restrict to finite 
classes with uniform weights (an exception being [Gen03], see the discussion 
section). This might be sufficient for many practical purposes but it prevents 
the application to more general classes of predictors. Examples are extrapolat- 
ing (=predicting) data points with the help of a polynomial (=expert) of degree 
d= 1,2,3,... -or- the (from a computational point of view largest) class of all com- 
putable predictors. Furthermore, most authors have concentrated on predicting 

* This work was supported by SNF grant 2100-67712.02. 
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binary sequences, often with the 0/1 loss for {0,l}-valued and the absolute loss 
for [0,l]-valued predictions. Arbitrary losses are less common. Nevertheless, it 
is easy to abstract completely from the predictions and consider the resulting 
losses only. Instead of predicting according to a “weighted majority” in each 
time step, one chooses one single expert with a probability depending on his 
past cumulated loss. This is done e.g. in [FS97], where an elegant WM variant, 
the Hedge algorithm, is analyzed. 

A different, general approach to achieve similar results is “Follow the Per- 
turbed Leader” (FPL). The principle dates back to as early as 1957, now called 
Hannan’s algorithm [Han57]. In 2003, Kalai and Vempala published a simpler 
proof of the main result of Hannan and also succeeded to improve the bound by 
modifying the distribution of the perturbation [KV03]. The resulting algorithm 
(which they call FPL*) has the same performance guarantees as the WM-type 
algorithms for fixed learning rate, save for a factor of -\/2- A major advantage we 
will discover in this work is that its analysis remains easy for an adaptive learn- 
ing rate, in contrast to the WM derivatives. Moreover, it generalizes to online 
decision problems other than PEA. 

In this work we study the FPL algorithm for PEA. The problems of WM 
algorithms mentioned above are addressed: We consider countable expert classes 
with arbitrary weights, adaptive learning rate, and arbitrary losses. Regarding 
the adaptive learning rate, we obtain proofs that are simpler and more elegant 
than for the corresponding WM algorithms. (In particular the proof for a self- 
confident choice of the learning rate. Theorem 7, is less than half a page). Further, 
we prove the first loss bounds for arbitrary weights and adaptive learning rate. 
Our result even seems to be the first for equal weights and arbitrary losses, 
however the proof technique from [ACBG02] is likely to carry over to this case. 

This paper is structured as follows. In Section 2 we give the basic definitions. 
Sections 3 and 4 derive the main analysis tools, following the lines of [KV03], but 
with some important extensions. They are applied in order to prove various upper 
bounds in Section 5. Section 6 proposes a hierarchical procedure to improve the 
bounds for non-uniform weights. Section 7 treats some additional issues. Finally, 
in Section 8 we discuss our results, compare them to references, and state some 
open problems. 

2 Setup and Notation 

Setup. Prediction with Expert Advice proceeds as follows. We are asked to 
perform sequential predictions yt&y at times t = l,2,.... At each time step t, we 
have access to the predictions {yl)i<i<n of n experts {ei,...,e„}. After having 
made a prediction, we make some observation Xt&X, and a Loss is revealed for 
our and each expert’s prediction. (E.g. the loss might be 1 if the expert made 
an erroneous prediction and 0 otherwise. This is the 0/1-loss.) Our goal is to 
achieve a total loss “not much worse” than the best expert, after t time steps. 

We admit WU{oo} experts, each of which is assigned a known complexity 
fc*>0. Usually we require ^or instance fc*=lnn if n<oo or ^-|-21m 
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if n=oo. Each complexity defines a weight by means of e“* and vice versa. In the 
following we will talk rather of complexities than of weights. If n is finite, then 
usually one sets /c*=lnn for all i, this is the case of uniform complexities/weights. 
If the set of experts is countably infinite (n = oo), uniform complexities are not 
possible. The vector of all complexities is denoted by A:= (fc*)i<j<„. At each time 
t, each expert i suffers a loss^ si =Loss{x t,yl) £ [0,1], and st = (S()i<i<„ is the 
vector of all losses at time t. Let s<t = si + ...+St_i (respectively si:t = si + ...+St) 
be the total past loss vector (including current loss St) and s™” = mini{s(.(} be 
the loss of the best expert in hindsight (BEH). Usually we do not know in advance 
the time t>0 at which the performance of our predictions are evaluated. 

General decision spaces. The setup can be generalized as follows. Let ScM^ 
be the state space and T> C K" the decision space. At time t the state is St € S, 
and a decision dtGT> (which is made before the state is revealed) incurs a loss 
dt°St, where denotes the inner product. This implies that the loss function 
is linear in the states. Conversely, each linear loss function can be represented 
in this way. The decision which minimizes the loss in state s G 5 is 

Mis) := argmin|(i°s| (1) 

deT> 

if the minimum exists. The application of this general framework to PEA is 
straightforward: T> is identified with the space of all unit vectors E={ei:l<i<n}, 
since a decision consists of selecting a single expert, and SjG [0,1]”, so states are 
identified with losses. Only Theorem 2 will be stated in terms of general decision 
space, where we require that all minima are attained.^ Our main focus is 'D = £. 
However, all our results generalize to the simplex V = A = {v& [0,1]" : 
since the minimum of a linear function on A is always attained on £. 

Follow the Perturbed Leader. Given s<t at time t, an immediate idea to 
solve the expert problem is to “Follow the Leader” (EL), i.e. selecting the expert 
Ci which performed best in the past (minimizes that is predict according to 
expert M(s<t). This approach fails for two reasons. First, for n=oo the minimum 
in (1) may not exist. Second, for n=2 and s=(ip°p°p'),FL always chooses the 
wrong prediction [KV03]. We solve the first problem by penalizing each expert by 
its complexity, i.e. predicting according to expert M(s<t + fc). The FPL (Follow 
the Perturbed Leader) approach solves the second problem by adding to each 
expert’s loss a random perturbation. We choose this perturbation to be 
negative exponentially distributed, either independent in each time step or once 
and for all at the very beginning at time t = 0. The former choice is preferable 
in order to protect against an adaptive adversary who generates the St, and in 
order to get bounds with high probability (Section 7). For the main analysis 
however, the latter is more convenient. Due to linearity of expectations, these 
two possibilities are equivalent when dealing with expected losses, so we can 
henceforth assume without loss of generality one initial perturbation q. 

^ The setup, analysis and results easily scale to s) G [0,S] for S>0 other than 1. 

^ Apparently, there is no natural condition on V and/or S which guarantees the exis- 
tence of all minima for n = oo. 
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The FPL algorithm is defined as follows: 

Choose random vector g^^exp, i.e. P[g^...g"] =e“* •...•e“* for g>0. 

For t= 

- Choose learning rate e*. 

- Output prediction of expert i which minimizes + — 

- Receive loss s\ for all experts i. 

Other than s<t, k and g, FPL depends on the learning rate St- We will give 
choices for £* in Section 5, after having established the main tools for the analysis. 
The expected loss at time t of FPL is £4 := if [M(s<t+ °Si] . The key idea 
in the FPL analysis is the use of an intermediate predictor IFPL (for Implicit 
or Infeasible FPL). IFPL predicts according to thus under the 

knowledge of s* (which is of course not available in reality). By :=if [M(si:t + 
°St] we denote the expected loss of IFPL at time t. The losses of IFPL will 
be upper bounded by BEH in Section 3 and lower bounded by FPL in Section 4. 

Notes. Observe that we have stated the FPL algorithm regardless of the actual 
predictions of the experts and possible observations, only the losses are relevant. 
Note also that an expert can implement a highly complicated strategy depend- 
ing on past outcomes, despite its trivializing identification with a constant unit 
vector. The complex expert’s (and environment’s) behavior is summarized and 
hidden in the state vector st =Loss(a;t, 2 / 4 )i<i<„. Our results therefore apply to 
arbitrary prediction and observation spaces y and X and arbitrary bounded loss 
functions. This is in contrast to the major part of PEA work developed for bi- 
nary alphabet and 0/1 or absolute loss only. Finally note that the setup allows 
for losses generated by an adversary who tries to maximize the regret of FPL 
and knows the FPL algorithm and all experts’ past predictions/losses. If the ad- 
versary also has access to FPL’s past decisions, then FPL must use independent 
randomization at each time step in order to achieve good regret bounds. 

3 IFPL Bounded by Best Expert in Hindsight 

In this section we provide tools for comparing the loss of IFPL to the loss of the 
best expert in hindsight. The first result bounds the expected error induced by 
the exponentially distributed perturbation. 

Lemma 1 (Maximum of Shifted Exponential Distributions). Lei g^,...,g" 

be exponentially distributed random variables, i.e. P[g*] = e“'^ for g* > 0 and 
l<i<n<oo, and k^GM be real numbers with • Then 

E[max{g* — L*}] < 1 -|- Inw. 

I 

Proof. Using P[g*> 6 ]<e“^ for bGiR we get 

n n 

P[max{g^ — k^} > a] = P[3i :q‘^ — k^ >a]< — A:* > a] < = u-e~^ 
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where the first inequality is the union bound. Using E\z] < E[max{0,z}] = 
/g P[max{0,z}>y]dy = /g P[z>y]dy (valid for any real-valued random variable 
z) for z = maxi{q'' — E}—hm, this implies 

pOO pOO 

U[max{( 7 * — fc*} — Inu] < / P \ max{q^ — E} > y + ln u]dy < / e~^dy = 1, 

* Jo * Jo 

which proves the assertion. □ 

If n is finite, a lower bound i?[maxi(jf*] >0.57721-1- Inn can be derived, showing 
that the upper bound on U[max] is quite tight (at least) for fc* = 0 Vz. The 
following bound generalizes [KV03, Lem. 3] to arbitrary weights. 

Theorem 2 (IFPL bounded by BEH). Let T> C IR", st G JR” for 1 < t < 

T (both T> and s may even have negative components, but we assume that all 
required extrema are attained), and q,k€lR^. Ifst>0 is decreasing in t, then 
the loss of the infeasible FPL knowing St at time t in advance (l.h.s.) can be 
bounded in terms of the best predictor in hindsight (first term on r.h.s.) plus 
additive corrections: 

^ ^ Q k 1 Ik 

M{si,t~\ )°st < min{d°(si:TH )}H max{d°{q—k)} M{si:t~\ )°q 

St dS:T> St St S(t 



Proof. For notational convenience, let eo = oo and si-,t = si-,t + ^ff-- Consider the 
losses Si = 'St + (fc~ 9 )(^~ ^^ 7 ) for the moment. We first show by induction on 
T that the infeasible predictor M{si-,t) has zero regret, i.e. 

T 

M ( si : t ) °St < M{si.,t) ( 2 ) 

t=l 

For T= 1 this is obvious. For the induction step from T— 1 to T we need to show 



M(s\-,t) °§t 'El M(s\-,t) °S\:T — M(^S<^t) °S<^T- 

This follows from si:T’ = s<t+st and M(si-,t) °s<:tPM(s^t) °s<t by minimality 
of M. Rearranging terms in (2), we obtain 



T T ^ ^ 

y^M(si:t)°St < M{si-.t)°si..t -^M{ si-.t)°{k - q)(^ ^ 

t=i t=i 

Moreover, by minimality of M, 

M{si:t)°Si:T E m(^S\-,t ^ ° ^Sl:T H 

d°{si,T H )!■ ~ m(si:T H 

£T J ^ £t-' 



(3) 



(4) 



= min 
d£Ti 



£T 
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holds. Using ^ > 0 and again minimality of M, we have 

T T 

—)M{Si,t)°{q -k) < ^(— —)M{k - q)°{q - k) 



£t-l 



£i 



( 5 ) 



= —M{k — q)°{q — k) = — max{d°{q — k)} 

St St d£D 



Inserting (4) and (5) back into (3) we obtain the assertion. 



□ 



Assuming q random with E[q''] = l and taking the expectation in Theorem 2, 
the last term reduces to — ■ If T’ > 0, the term is nega- 
tive and may be dropped. In case of T> = E or A, the last term is identical to 
— ^ (since = Y keeping it improves the bound. Furthermore, we need 
to evaluate the expectation of the second to last term in Theorem 2, namely 
E[mai,Xdev{d°{q—k)}]. For T> = £ and q being exponentially distributed, using 
Lemma 1, the expectation is bounded by 1-l-lnM. We hence get the following 
bound: 



Corollary 3 (IFPL bounded by BEH). For V = £ and ^^e and 

P[q'‘] = e“® for q>0 and decreasing St > 0, the expected loss of the infeasible 
FPL exceeds the loss of expert i by at most E/st-' 

ri-T < Si-T + —k^ VL 
St 



Theorem 2 can be generalized to expert dependent factorizable e\ = 
SfS^ by scaling E^Eje'' and cf^cfle''. Using E[maxi{ ^ }] <U[max;{g^ — 
fc*}] /mingle*}. Corollary 3, generalizes to 



T 



k-q. 



E\^M{si:t^ < 

t=l 



’1:T 



+ —E + 

sir 



Vi, 



where := minije^}. For example, for el = ^jE/t, additionally assuming 
fc* > 1 Vi, we get the desired bound s\.rp + y^T-{E + l). Unfortunately we were 
not able to generalize Theorem 4 to expert-dependent e, necessary for the final 
bound on FPL. In Section 6 we solve this problem by a hierarchy of experts. 



4 Feasible FPL Bounded by Infeasible FPL 

This section establishes the relation between the FPL and IFPL losses. Recall 
that it = E[M{s^t+^^)°St] is the expected loss of FPL at time t and rt = 
if [M(si:(-|-^^) °st] is the expected loss of IFPL at time t. 
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Theorem 4 (FPL bounded by IFPL). For V = £ and 0 < sj < 1 Vi and 

arbitrary s^t o,nd P[q]=e~^i'^ for q>0, the expected loss of the feasible FPL 
is at most a factor e^* > 1 larger than for the infeasible FPL: 

T 

it < which implies t\-,T — fi:T < etit- 

t=i 

Furthermore, i/£t<l, then also < (l+£t+£t )rt < (l + 2et)rt. 

Proof. Let s = s<^t + \k be the past cumulative penalized state vector, <7 be a 
vector of independent exponential distributions, i.e. P[< 7 *]=e“® , and s = et- We 
now define the random variables / := argminj{s* — and J := argmini{s* + 
S(— where 0<sj< 1 Vi. Furthermore, for fixed vector xGM^ and fixed j 
we define m:=mini^j{s*— ix*}<mini^j{s* + sJ— With this notation 
and using the independence of q^ from g* for all i^j, we get 

= j|<?* = x^'ii^ f\ = P[s^ — kqi < m\q^ = x^'Mi^ j] = P[q^ > e{s^ — m)] 

< e’^P[q^ > s(s^ — m + 1 )] < e^P[q^ > s(s^ + s{ — m')] 

= e^P[s^ + s{ — kqi < m'\q’^ = x'^'ii ^ j] = e'^P[J = j\q^ = x'‘\H^ j], 

where we have used P[< 7 ^ > a] <e'^P[( 7 -^ >a+£]. Since this bound holds under any 
condition x, it also holds unconditionally, i.e. P\I = j]<&^ P[J = j] - For T> = £ we 
have Sj =M(s<t + ^^) and sf = M{si-t+’^^)°St, which implies 

n n 

it = E[sl] = J24-P{I = j] < = = e®P[s/] = e^rt. 

i=i i=i 

Finally, it~rt<Stit follows from rt>e~^*it>{l — St)it, and < e^'r* < (l+£t + 

£* )£t < (l + 2 £t)rt for £t < 1 is elementary. □ 

Remark. As in [KV03], one can prove a similar statement for general decision 
space T> as long as is guaranteed for some A>0: In this case, we have 

it<P^*^rt- If n is finite, then the bound holds for A = n. For n=oo, the assertion 
holds under the somewhat unnatural assumption that S is ?^-bounded. 

5 Combination of Bounds and Choices for St 

Throughout this section, we assume 

V = £, St G [0, 1]" Vt, P[q]=e~'^i'^ for <7 > 0, and ^1- ( 6 ) 



We distinguish static and dynamic bounds. Static bounds refer to a constant 
£t = £. Since this value has to be chosen in advance, a static choice of £t requires 
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certain prior information and therefore is not practical in many cases. However, 
the static bounds are very easy to derive, and they provide a good means to 
compare different PEA algorithms. If on the other hand the algorithm shall be 
applied without appropriate prior knowledge, a dynamic choice of Et depending 
only on t and/or past observations, is necessary. 

Theorem 5 (FPL bound for static = eoc 1/\/L). Assume (6) holds, then 
the expected loss it of feasible FPL, which employs the prediction of the expert i 
minimizing s^^t + bounded by the loss of the best expert in hindsight in 

the following way: 

i) For et = e = l/'/L with L > £i,t we have 
ii-.T < s\,t + VI{k^ + 1) Vz 

ii) For Et = \J KjL with L > ii,T and < K \/i we have 
ii-.T < s\,T + 2y/LK Vi 

Hi) For Et = \/kdjL with L > max{s5^.p, k^} we have 

ii-.T < s\,rp + 2^/W + 3k* 



Note that according to assertion (Hi), knowledge of only the ratio of the 
complexity and the loss of the best expert is sufficient in order to obtain good 
static bounds, even for non-uniform complexities. 



Proof. {i,ii) For Et = \J K / L and L>ii-T, 
get T 

il:T — l-.T < = i\-.T\/ Kj L < V LK 

t=l 



from Theorem 4 and Corollary 3, we 



and ri,T — s\-T ^ k"^ / 



Combining both, we get 1\,t — s\.j.<\nfj(pJfFFk'' j \[K). (i) follows from K=\ 
and {ii) from /c* < K. 

{Hi) For E= ^/WJL< \ we get 

ii-.T < e^ri-T’ < (1 -I- e -I- £^)ri,T < (1 + 

< s\,T + + (\/^ + ^){L + ^/lIA) = sir + 2v^ -k (2 -k \F^)k^ 

\ 1j 1j y 1 j 

□ 



The static bounds require knowledge of an upper bound L on the loss (or the 
ratio of the complexity of the best expert and its loss). Since the instantaneous 
loss is bounded by 1, one may set L = T if T is known in advance. For finite n 
and fc* = AT = lnn, bound {ii) gives the classic regret If neither T nor 

L is known, a dynamic choice of Et is necessary. We first present bounds with 
regret oc ^/T, thereafter with regret oc 

Theorem 6 (FPL bound for dynamic £t oc Assume (6) holds. 



i) For Et = lly/t we have 1 \.,t < s\.rp + Vt { k^ + 2) Vz 
ii) For Et = \/ Kj2t and kd < K Mi we have i\,T < s\.j. + 2V2TK Mi 
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Proof. For St = ^/K/2t, using = 

T 

^l:T — T\:T < £i < V 2T K and ri:T — 

t=l 



2'/T and it<l we get 

- I 2 T 

s\.T < ~ 



Combining both, we get ti-,T — s\.rp<V^{'s/K+k'^/'/K)- (*) follows from K = 2 
and (ii) from <K. □ 



In Theorem 5 we assumed knowledge of an upper bound L on fi:T- In an 
adaptive form, Tt:=^<t + 1, known at the beginning of time t, could be used as 
an upper bound on £i-t with corresponding adaptive etcmlf y/Lt- Such choice of 
St is also called self-confident [ACBG02]. 

Theorem 7 (FPL bound for self-confident Assume (6) holds, 

i) For St = 1/ + 1) we have 

h-.T < s\,T + (fc* + l) Y^2(4,y + 1) + 2(fc* + l)2 Vt 

ii) For St = a/ Kf2{i^t + 1) md lA < K'ii we have 
h,T < s\,T + 2^2{s\.j, + l)K + SK Vi 



Proof. Using St = K/2{£^t + ^)<\/ K/2£i-,t and ^ = {Vb-^/a){Vb-\-^/a)^< 

2{'/b—^/a) for a<b and to:=niin{f:fi:t >0} we get 



T 

^ St£t < 




T 

< [vCt-v^t] = 

t = to 



Adding ri,r — ^ <fc*-\/2(fi.7’ + l)/Ar we get 

^i-.T — s\.,T ^ \/2fo(^i:T + l)) where := Vlv + / Vk . 



Taking the square and solving the resulting quadratic inequality w.r.t. fi;T we 
get 



£\:T fii ^\_-T F rd F '^2(s(.'ji-t- F F s\^.rp -t- -\J 2(s(,j.-t-l)/t* F 2k^ 

For K=1 we get -\/^=fc* + l which yields fi). For k^<K we get R^FAK which 
yields (ii). □ 

The proofs of results similar to (ii) for WM for 0/1 loss all fill several pages 
[ACBG02,YEYS04]. The next result establishes a similar bound, but instead of 
using the expected value £<t, the best loss so far s™" is used. This may have 
computational advantages, since s™” is immediately available, while f<t needs 
to be evaluated (see discussion in Section 7). 
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Theorem 8 (FPL bound for adaptive Stocl/ Assume (6) holds. 

i) For £t = l/min{fc* + + 2} we have 

h:T < s\.T + (fc*+ 2)^j2s\.j, + 2(F+ 2)2 Vi 

ii) For £t = Y^-min{l, ^ iV/s™"} and <K'ii we have 

h:T < s\,T + 2^J2K^+5Kln{s\,J,) + SK + 6 Vi 

We briefly motivate the strangely looking choice for £t in (i). The first naive can- 
didate, £tOcl / turns out too large. The next natural trial is requesting £t = 

l/^2min{s*<i-kf^}. Solving this equation results in £t = l/{k^ + \/{k^y + 2s’^f), 

where i be the index for which s<t -|- is minimal. 

Proof. Similar to the proof of the previous theorem, but more technical. □ 

The bound (i) is a complete square, and also the bounds of Theorem 7 when 
adding 1 to them. Hence the bounds can be written as s\,.j.+\/2{k^+2) 

and V^i:T < s\.j, + l + \/SK and s\.j, + l + ^/2{k'‘ + 1), respectively, 

hence the \/Loss-regrets are bounded for T— >-oo. 

Remark. The same analysis as for Theorems [5-8] (li) applies to general T>, using 
<e^*"rt instead of it and leading to an additional factor ^/n in the 

regret. Compare the remark at the end of Section 4. 



6 Hierarchy of Experts 

We derived bounds which do not need prior knowledge of L with regret oc VTK 
and oc s\.j,K for a finite number of experts with equal penalty K=k^ = lnn. For 
an infinite number of experts, unbounded expert-dependent complexity penalties 
fc* are necessary (due to constraint ^1)- Bounds for this case (without 

prior knowledge of T) with regret oc fc^-x/T and oc s\,rp have been derived. 
In this case, the complexity fc* is no longer under the square root. It is likely 
that improved regret bounds oc Vrfc® and oc as in the finite case hold. 

We were not able to derive such improved bounds for FPL, but for a (slight) 
modification. We consider a two-level hierarchy of experts. First consider an 
FPL for the subclass of experts of complexity K, for each K € IV. Regard these 
FPL^ as (meta) experts and use them to form a (meta) FPL. The class of meta 
experts now contains for each complexity only one (meta) expert, which allows 
us to derive good bounds. In the following, quantities referring to complexity 
class K are superscripted by K, and meta quantities are superscripted by ~. 

Consider the class of experts := {i : K— 1 < k^ < K} of complexity K, for 
each KgIN. FPL*^ makes randomized prediction j/" :=argminjggK|s!^^-|- } 
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with ef- := \jKj2t and suffers loss := Sj* at time t. Since <K Vi G we 
can apply Theorem 6(ii) to FPL*^: 

E[u^,j,] = < s{,T + 2V2TAT Vi G VAT G IN. (7) 

We now define a meta state = uf and regard FPL^ for Kg IN as meta experts, 
so meta expert K suffers loss . (Assigning expected loss = E[u^]=£^ to 
FPL^ would also work.) Hence the setting is again an expert setting and we 
define the meta FPL to predict It := argminj^gjv{g<t + } with et = f!'Jt 
and = \ + 2lrrK (implying — ^)- Note that = sf- + ... + sf- = 

jK jK 

+... + Sj* sums over the same meta state components K, but over different 
components in normal state representation. 

By Theorem 6(i) the g-expected loss of FPL is bounded by s^.rp+\/T{k^ +2). 
As this bound holds for all q it alsojjolds in g-expectation. So if we define £i-,t 
to be the q and q expected loss of FPL, and chain this bound with (7) for 
we get: 



il:T<E[§^,T^Vf{k^^2)] = + 

— ^i:T E \/r[2\/2[k'^ 1) + ^ + 2 ln(/c^ H- 1) H- 2] , 

where we have used iV< fc* + l. This bound is valid for all i and has the desired 
regret o^VtE. Similarly we can derive regret bounds oc a/ by exploiting 
that the bounds in Theorems 7 and 8 are concave in s\.j, and using Jensen’s 
inequality. 



Theorem 9 (Hierarchical FPL bound for dynamic £*). The hierarchical 
FPL employs at time t the prediction of expert it'.= It* , where 



:= arg minjs^^ k } and 

i:\ki^=K 



It '■= arg min s 

KeiN 



iin< ; 

M 1 



S ■ 
’t-1 



i+2 In K-q^ 



E ~k 

, the expected loss £i-,t = E[s\^ + ...+ 

Sjf ] of FPL is bounded as follows: 



a) For = \/ Kj2t and it = If Vi we have 

ii-.T < s\.t + 2 V 2 TV -{1 + 0 {'^)) VL 

7 

b) For it as in (i) and as in {ii) of Theorem { we have 



The hierarchical FPL differs from a direct FPL over all experts £. One po- 
tential way to prove a bound on direct FPL may be to show (if it holds) that 
FPL performs better than FPL, i.e. Another way may be to suitably 

generalize Theorem 4 to expert dependent e. 
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7 Miscellaneous 



Lower Bound on FPL. For finite n, a lower bound on FPL similar to the 
upper bound in Theorem 2 can also be proven. For any and StGlR such 

that the required extrema exist, gGfR”, and >0 decreasing, the loss of FPL 
for uniform complexities can be lower bounded in terms of the best predictor in 
hindsight plus/minus additive corrections: 



M(s<t - — )°st > min{d°si,T} - — max{d°g} + y^( — —)M{s^t) 



1 1 



£t 



dev 



St 



St St-l 



( 8 ) 

For T> = S and any S and all fc* equal and P[q^] =e“'^ for <7>0 and decreasing 



£t > 0, this reduces to 



t-l:T 



> 



min 

^1:T 



Inn 

St 



(9) 



The upper and lower bounds on ii-x (Theorem 4 and Corollary 3 and (9)) 
together show that 



— >■ 1 if £( — >■ 0 and SfS™}" —>■ oo and = K Vi. (10) 

For instance, £t = i/iF/2s™". For £j = y^iV/2(£<i + l) we proved the bound in 
Theorem Knowing that \/Kr/2(£<t + l) converges to a/K/2s™” due to 

(10), we can derive a bound similar to Theorem l{ii) for £* = A/iV/2s™”. This 
choice for £t has the advantage that we do not have to compute (see below), as 
also achieved by Theorem 8(zi). We do not know whether (8) can be generalized 
to expert dependent complexities fc*. 

Initial versus independent randomization. So far we assumed that the 
perturbations are sampled only once at time t = 0. As already indicated, under 
the expectation this is equivalent to generating a new perturbation qt at each 
time step t, i.e. Theorems 4-9 remain valid for this case. While the former 
choice was favorable for the analysis, the latter has two advantages. First, if 
the losses are generated by an adaptive adversary, then he may after some time 
figure out the initial random perturbation and use it to force FPL to have a 
large loss. On the other hand, for independent randomization, one can show 
that our bounds remain valid, even if the environment has access to FPL’s past 
predictions. Second, repeated sampling of the perturbations guarantees better 
bounds with high probability. 

Bounds with high probability. We have derived several bounds for the ex- 
pected loss £i:t of FPL. The actual loss at time t is Ut = M(s<i-|- ^y) °st. A 
simple Markov inequality shows that the total actual loss ui-t exceeds the total 
expected loss £i-.t = E[ui-,t] by a factor of c> 1 with probability at most 1/c: 



P[ui:T > C-£i:t] < 1/c. 
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Randomizing independently for each t as described in the previous paragraph, 
the actual loss is = °st with the same expected loss ii-.T = E[ui:T] 

as before. The advantage of independent randomization is that we can get a 
much better high-probability bound. We can exploit a Chernoff-Hoeffding bound 
[McD89, Cor. 5. 2b], valid for arbitrary independent random variables 0<ut<l 
for t=l,...,T: 



P 







< 2exp(-|(5^S[Mi.T]), 



0 < (5 < 1 . 



For 5 = we get 

as soon as £i-t > 3c. (11) 

Using (11), the bounds for ii-x of Theorems 5-8 can be rewritten to yield similar 
bounds with high probability (1 — 2e“'^) for mu with small extra regret (xy/c-L 
or oc ^Jc■s\.rp. Furthermore, (11) shows that with high probability, 
converges rapidly to 1 for £i-,t^oo. Hence we may use the easier to compute 
St = y/K/2uct instead of St = y/ K/2{£^t + ^), with similar bounds on the regret. 

Computational Aspects. It is easy to generate the randomized decision of 
FPL. Indeed, only a single initial exponentially distributed vector q G IR" is 
needed. Only for adaptive £t oc \ jyj£<x (see Theorem 7) we need to compute 
expectations explicitly. Given £t, from we need to compute it in order 

to update £*. Note that it=wt°st, where wl = P[It = i] and It:=atrgmiiii^s{sl:^t + 
^ } is the actual (randomized) prediction of FPL. With s:=s^t+k/et, P[It=i] 

has the following representation: 

M-.iiyCMCM 

In the last equality we expanded the product and performed the resulting expo- 
nential integrals. For finite n, the one-dimensional integral should be numerically 
feasible. Once the product 0^=1 been computed in time 0{n), 

the argument of the integral can be computed for each i in time 0(1), hence the 
overall time to compute it is O(c-n), where c is the time to numerically com- 
pute one integral. For infinite^ n, the last sum may be approximated by the 
dominant contributions. The expectation may also be approximated by (monte 
carlo) sampling It several times. Recall that approximating £^t can be avoided 
by using s™” (Theorem 8) or u^t (bounds with high probability) instead. 

Deterministic prediction and absolute loss. Another use of Wt from the 
last paragraph is the following: If the decision space is T> = A, then FPL may 
make a deterministic decision d=WtGA at time t with bounds now holding for 
sure, instead of selecting with probability wj. For example for the absolute 
loss si = \xt — yl\ with observation xt G [0,1] and predictions yl G [0,1], a master 

® For practical realizations in case of infinite n, one must use finite subclasses of 
increasing size, compare [LW94]. 



P[It=i] = / £te~ 




292 



M. Hutter and J. Poland 



algorithm predicting deterministically Wt°yt G [0,1] suffers absolute loss \xt~ 
wt°yt\ ^^iWl\xt — yl\ = it, and hence has the same (or better) performance 
guarantees as FPL. In general, masters can be chosen deterministic if prediction 
space y and loss-function Loss(x,y) are convex. 



8 Discussion and Open Problems 

How does FPL compare with other expert advice algorithms? We briefly discuss 
four issues. 

Static bounds. Here the coefficient of the regret term V KL, referred to as the 
leading constant in the sequel, is 2 for FPL (Theorem 5). It is thus a factor of 
^/2 worse than the Hedge bound for arbitrary loss [FS97], which is sharp in some 
sense [Vov95]. For special loss functions, the bounds can sometimes be improved, 
e.g. to a leading constant of 1 in the static WM case with 0/1 loss [CB97]. 

Dynamic bounds. Not knowing the right learning rate in advance usually costs 
a factor of -\/2- This is true for Hannan’s algorithm [KV03] as well as in all our 
cases. Also for binary prediction with uniform complexities and 0/1 loss, this 
result has been established recently - [YEYS04] show a dynamic regret bound 
with leading constant '/2{l + s). Remarkably, the best dynamic bound for a 
WM variant proven in [ACBG02] has a leading constant 2-\/2, which matches 
ours. Considering the difference in the static case, we therefore conjecture that 
a bound with leading constant of 2 holds for a dynamic Hedge algorithm. 

General weights. While there are several dynamic bounds for uniform weights, 
the only result for non-uniform weights we know of is [Gen03, Cor. 16], which gives 
a dynamic bound for a p-norm algorithm for the absolute loss if the weights are 
rapidly decaying. Our hierarchical FPL bound in Theorem 9 (6) generalizes it 
to arbitrary weights and losses and strengthens it, since both, asymptotic order 
and leading constant, are smaller. Also the FPL analysis gets more complicated 
for general weights. We conjecture that the bounds oc VTfc* and oc \/ s\.j,k^ also 
hold without the hierarchy trick, probably by using expert dependent learning 
rate e(. 

Comparison to Bayesian sequence prediction. We can also compare the 
worst-case bounds for FPL obtained in this work to similar bounds for Bayesian 
sequence prediction. Let {t'i} be a class of probability distributions over sequences 
and assume that the true sequence is sampled from /i € {zzi} with complexity 
* < !)• Then it is known that the Bayes-optimal predictor based 
on the 2“^ ‘-weighted mixture of r'i’s has an expected total loss of at most 
, where is the expected total loss of the Bayes-optimal 
predictor based on p [Hut03a, Thm.2]. Using FPL, we obtained the same bound 
except for the leading order constant, but for any sequence independently of the 
assumption that it is generated by pL. This is another indication that a PEA 
bound with leading constant 2 could hold. See [Hut03b, Sec. 6. 3] for a more 
detailed comparison of Bayes bounds with PEA bounds. 




Prediction with Expert Advice 293 



References 

[ACBG02] P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-confident 
on-line learning algorithms. Journal of Computer and System Sciences, 
64(l):48-75, 2002. 

[AGOO] P. Auer and G. Gentile. Adaptive and self-confident on-line learning algo- 
rithms. In Proceedings of the 13th Conference on Computational Learning 
Theory, pages 107-117. Morgan Kaufmann, San Francisco, 2000. 

[GB97] N. Cesa-Bianchi et al. How to use expert advice. Journal of the ACM, 
44(3):427-485, 1997. 

[FS97] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line 
learning and an application to boosting. Journal of Computer and System 
Sciences, 55(1):119-139, 1997. 

[Gen03] C. Gentile. The robustness of the p-norm algorithm. Machine Learning, 
53(3):265-299, 2003. 

[Han57] J. Hannan. Approximation to Bayes risk in repeated plays. In M. Dresher, 
A. W. Tucker, and P. Wolfe, editors. Contributions to the Theory of Carnes 
3, pages 97-139. Princeton University Press, 1957. 

[Hut03a] M. Hutter. Convergence and loss bounds for Bayesian sequence prediction. 
IEEE Transactions on Information Theory, 49(8):2061-2067, 2003. 

[Hut03b] M. Hutter. Optimality of universal Bayesian prediction for general loss and 
alphabet. Journal of Machine Learning Research, 4:971-1000, 2003. 

[KV03] A. Kalai and S. Vempala. Efficient algorithms for online decision. In 
Proceedings of the 16th Annual Conference on Learning Theory (COLT- 
2003), Lecture Notes in Artificial Intelligence, pages 506-521, Berlin, 2003. 
Springer. 

[LW89] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. In 
30th Annual Symposium on Foundations of Computer Science, pages 256- 
261, Research Triangle Park, North Carolina, 1989. IEEE. 

[LW94] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. 
Information and Computation, 108(2):212-261, 1994. 

[McD89] C. McDiarmid. On the method of bounded differences. Surveys in Combi- 
natorics, 141, London Mathematical Society Lecture Notes Series: 148-188, 
1989. 

[Vov90] V. G. Vovk. Aggregating strategies. In Proceedings of the Third Annual 
Workshop on Computational Learning Theory, pages 371-383, Rochester, 
New York, 1990. ACM Press. 

[Vov95] V. G. Vovk. A game of prediction with expert advice. In Proceedings of the 
8th Annual Conference on Computational Learning Theory, pages 51-60. 
ACM Press, New York, NY, 1995. 

[YEYS04] R. Yaroshinsky, R. El-Yaniv, and S. Seiden. How to better use expert 
advice. Machine Learning, 2004. 




On the Convergence Speed of MDL Predictions 
for Bernoulli Sequences* 



Jan Poland and Marcus Hutter 

IDSIA, Galleria 2 CH-6928 Manno (Lugano), Switzerland 
{jan,marcus}@idsia. ch 



Abstract. We consider the Minimum Description Length principle for 
online seqnence prediction. If the underlying model class is discrete, then 
the total expected square loss is a particularly interesting performance 
measure: (a) this quantity is bounded, implying convergence with prob- 
ability one, and (b) it additionally specifies a rate of convergence. Gen- 
erally, for MDL only exponential loss bounds hold, as opposed to the 
linear bounds for a Bayes mixture. We show that this is even the case if 
the model class contains only Bernoulli distributions. We derive a new 
upper bound on the prediction error for countable Bernoulli classes. This 
implies a small bound (comparable to the one for Bayes mixtures) for 
certain important model classes. The results apply to many Machine 
Learning tasks including classification and hypothesis testing. We pro- 
vide arguments that our theorems generalize to countable classes of i.i.d. 
models. 



1 Introduction 

“Bayes mixture” , “SolomonofF induction” , “marginalization” , all these terms re- 
fer to a central induction principle: Obtain a predictive distribution by inte- 
grating the product of prior and evidence over the model class. In many cases 
however, the Bayes mixture cannot be computed, and even a sophisticated ap- 
proximation is expensive. The MDL or MAP (maximum a posteriori) estimator 
is both a common approximation for the Bayes mixture and interesting for its 
own sake: Use the model with the largest product of prior and evidence. (In 
practice, the MDL estimator is usually being approximated too, in particular 
when only a local maximum is determined.) 

How good are the predictions by Bayes mixtures and MDL? This question 
has attracted much attention. In the context of prediction, arguably the most 
important quality measure is the total or cumulative expected loss of a predictor. 
A very common choice of loss function is the square loss. Throughout this paper, 
we will study this quantity in an online setup. 

Assume that the outcome space is finite, and the model class is continu- 
ously parameterized. Then for Bayes mixture prediction, the cumulative ex- 
pected square loss is usually small but unbounded, growing with logn, where 
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n is the sample size [1]. This corresponds to an instantaneous loss bound of 
For the MDL predictor, the losses behave similarly [2,3] under appropriate con- 
ditions, in particular with a specific prior. (Note that in order to do MDL for 
continuous model classes, one needs to discretize the parameter space, e.g. [4].) 

On the other hand, if the model class is discrete, then SolomonofPs theorem 
[5,6] bounds the cumulative expected square loss for the Bayes mixture predic- 
tions finitely, namely by lnr<;“^, where is the prior weight of the “true” model 
/i. The only necessary assumption is that the true distribution /i is contained in 
the model class. For the corresponding MDL predictions, we have shown [7] that 
a bound of w~^ holds. This is exponentially larger than the Solomonoff bound, 
and it is sharp in general. A finite bound on the total expected square loss is 
particularly interesting: 

1. It implies convergence of the predictive to the true probabilities with prob- 
ability one. In contrast, an instantaneous loss bound which tends to zero 
implies only convergence in probability. 

2. Additionally, it gives a convergence speed, in the sense that errors of a certain 
magnitude cannot occur too often. 

So for both, Bayes mixtures and MDL, convergence with probability one holds, 
while the convergence rate is exponentially worse for MDL compared to the 
Bayes mixture. 

It is therefore natural to ask if there are model classes where the cumulative 
loss of MDL is comparable to that of Bayes mixture predictions. Here we will 
concentrate on the simplest possible stochastic case, namely discrete Bernoulli 
classes (compare also [8]). It might be surprising to discover that in general 
the cumulative loss is still exponential. On the other hand, we will give mild 
conditions on the prior guaranteeing a small bound. We will provide arguments 
that these results generalize to arbitrary i.i.d. classes. Moreover, we will see that 
the instantaneous (as opposed to the cumulative) bounds are always small (« ^). 
This corresponds to the well-known fact that the instantaneous square loss of 
the Maximum Likelihood estimator decays as - in the Bernoulli case. 

A particular motivation to consider discrete model classes arises in Algo- 
rithmic Information Theory. From a computational point of view, the largest 
relevant model class is the countable class of all computable models (isomorphic 
to programs) on some fixed universal Turing machine. We may study the cor- 
responding Bernoulli case and consider the countable set of computable reals in 
[0, 1]. We call this the universal setup. The description length K{'&) of a param- 
eter d G [0,1] is then given by the length of the shortest program that outputs 
d, and a prior weight may be defined by 

Many Machine Learning tasks are or can be reduced to sequence prediction 
tasks. An important example is classification. The task of classifying a new 
instance after having seen (instance,class) pairs (zi, ci), ..., (z„-i, c„_i) can 
be phrased as to predict the continuation of the sequence ZiCi...z„_iC„_iZ„. 
Typically the (instance,class) pairs are i.i.d. 

Our main tool for obtaining results is the Kullback-Leibler divergence. Lem- 
mata for this quantity are stated in Section 2. Section 3 shows that the expo- 
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nential error bound obtained in [7] is sharp in general. In Section 4, we give an 
upper bound on the instantaneous and the cumulative losses. The latter bound 
is small e.g. under certain conditions on the distribution of the weights, this is 
the subject of Section 5. Section 6 treats the universal setup. Finally, in Section 
7 we discuss the results and give conclusions. 

2 Kullback-Leibler Divergence 

Let B = {0,1} and consider finite strings a; G B* as well as infinite sequences 
x<oo G B°°, with the first n bits denoted by If we know that x is generated 
by an i.i.d random variable, then P{xi = 1) = do for all 1 < f < £{x) where £{x) 
is the length of x. Then x is called a Bernoulli sequence, and do G 6> C [0, 1] the 
true parameter. In the following we will consider only countable 0, e.g. the set 
of all computable numbers in [0, 1]. 

Associated with each d G 6>, there is a complexity or description length Kw{'&) 
and a weight or (semi) probability The complexity will often but 

need not be a natural number. Typically, one assumes that the weights sum up 
to at most one, Then, by the Kraft inequality, for all d G 6? there 

exists a prefix-code of length Kw{'&). Because of this correspondence, it is only a 
matter of convenience if results are developed in terms of description lengths or 
probabilities. We will choose the former way. We won’t even need the condition 
< 1 for most of the following results. This only means that Kw cannot 
be interpreted as a prefix code length, but does not cause other problems. 

Given a set of distributions 0 C [0,1], complexities a true 

distribution do G 0, and some observed string a; G B*, we define an MDL 
estimator^: 

= argmax|w,jP(a;|do = d)|. 

Here, P(x|do = d) is the probability of observing x if d is the true parameter. 
Clearly, P(x|do = d) = where II(x) is the number of ones 

in X. Hence P{x\do = d) depends only on £{x) and II(x). We therefore see 

^ =argm^{u;,,(d“(l-d)i-“)”} (1) 

= argmin{n-I?(Q;||d) -|- Kw{£))- ln2|, 

where n = £{x) and a := is the observed fraction of ones and 
£>(a||d) = a In I -k (1 - a) In Xf 

^ Precisely, we define a MAP (maximum a posteriori) estimator. For two reasons, 
our definition might not be considered as MDL in the strict sense. First, MDL is 
often associated with a specific prior, while we admit arbitrary priors. Second and 
more importantly, when coding some data x, one can exploit the fact that once the 
parameter d“’ is specified, only data which leads to this d’' needs to be considered. 
This allows for a description shorter than Kw{£}^). Nevertheless, the construction 
principle is commonly termed MDL, compare e.g. the “ideal MDL” in [9]. 
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is the Kullback-Leibler divergence. Let d, d G 0 he two parameters, then it 
follows from (1) that in the process of choosing the MDL estimator, -d is being 
preferred to t? iff 

n(l?(a||d) — D(a||'d)) > ln2 • (ifw(d) — Kw{^)). (2) 

In this case, we say that t? beats "d. It is immediate that for increasing n the in- 
fluence of the complexities on the selection of the maximizing element decreases. 
We are now interested in the total expected square prediction error (or cumula- 
tive square loss) of the MDL estimator — do)^. In terms of [7], this 

is the static MDL prediction loss, which means that a predictor/estimator 
is chosen according to the current observation x. The dynamic method on the 
other hand would consider both possible continuations xO and xl and predict 
according to and In the following, we concentrate on static predictions. 
They are also preferred in practice, since computing only one model is more 
efficient. 

Let = {| : 0 < k < n} . Given the true parameter do and some n G N, 
the expectation of a function : {0, . . . , n} — >■ K is given by 

^ p(a|n)/(an), where p(a|n) = r^Vdo(I - . (3) 

(Note that the probability p{a\n) depends on do, which we do not make explicit 
in our notation.) Therefore, 

OO OO 

^ E(d--" - do)2 = E E - ^o)". (4) 

n=l n—1 aGAn 

X X ^ 

Denote the relation / = 0{g) hy f < g. Analogously define “>” and “=”. From 
[7, Corollary 12], we immediately obtain the following result. 

Theorem 1. The cumulative loss bound E(d“i^" — do)^ < 2^“^’^“) holds. 

This is the “slow” convergence result mentioned in the introduction. In con- 
trast, for a Bayes mixture, the total expected error is bounded by Au;(do) rather 
than (see [5] or [6, Th.lj). An upper bound on E(d®i'” — dg)^ is 

termed as convergence in mean sum and implies convergence -A dg with 

probability 1 (since otherwise the sum would be infinite). 

We now establish relations between the Kullback-Leibler divergence and the 
quadratic distance. We call bounds of this type entropy inequalities. 

Lemma 2. Let d,d G (0, 1) and d* = argminjjd — 1 1, |d — ^|}, i.e. d* is the 
element from {d, d} which is closer to ^ . Then 

2 . (d - d)2 < D(d||d) < |(d - d)2 and 

- ■’>' *‘i’> . 
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Thereby, {ii) requires G [j, |], {Hi) requires 'd,'d < and {iv) requires 
'd < \ and d € [IjS"!?]. Statements {Hi) and {iv) have symmetric counterparts 
for 1? > 5 • 

Proof. The lower bound (i), is standard, see e.g. [10, p. 329]. In order to verify 
the upper bound {ii), let f{rj) = D{d\\ri) — |(ry — Then {ii) follows from 
f{v) < 0 for ?7 G [i, f]. We have that f{d) = 0 and f{iq) = - ^{p - d). 

This difference is nonnegative if and only rj — d < 0 since 77(1 — 77) > This 
implies f{r]) < 0. Statements {Hi) and {iv) giving bounds if d is close to the 
boundary are proven similarly. □ 



Lemma 2 {ii) is sufficient to prove the lower bound on the error in Proposition 
5. The bounds {Hi) and {iv) are only needed in the technical proof of the upper 
bound in Theorem 8, which will be omitted. It requires also similar upper and 
lower bounds for the absolute distance, and if the second argument of ld(-||-) 
tends to the boundary. The lemma remains valid for the extreme cases d,d G 
{0, 1} if the fraction is properly defined. It is likely to generalize to arbitrary 
alphabet, for (i) this is shown in [6]. 

It is a well-known fact that the binomial distribution may be approximated 
by a Gaussian. Our next goal is to establish upper and lower bounds for the 
binomial distribution. Again we leave out the extreme cases. 



Lemma 3. Let do G (0, 1) be the true parameter, n>2 and 1 < k < n — 1, and 
a = ^ . Then the following assertions hold. 



{i) p{a\n) < 
{ii) p{a\n) > 



1 



■\/27ra(l 

1 

,/8a(l - 



=^exp ( - 77L>(a||7?o)), 
- a)n 

==exp ( - nD{a\\do)). 
a)n 



The lemma is verified using Stirling’s formula. The upper bound is sharp for 
77 — >■ 00 and fixed a. Lemma 3 can be easily combined with Lemma 2, yielding 
Gaussian estimates for the Binomial distribution. The following lemma is proved 
by simply estimating the sums by appropriate integrals. 



Lemma 4. Let z G K"*", then 



r 1 °° 

S ^ ■ exp(-2:^77) < 

^ n—1 

00 

{ii) 77“ 2 exp( — Z^TT.) < y/frl z. 

n—\ 



2z3 



— and 
z\f2e 



3 Lower Bound 

We are now in the position to prove that even for Bernoulli classes the upper 
bound from Theorem 1 is sharp in general. 
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Proposition 5. Let ^ be the true parameter generating sequences of 

fair coin flips. Assume there are 2^ — 1 other parameters di, . . . with 

'dk = \ + 2~^~^. Let all complexities he equal, i.e. Kw{'&o) = Kw{di) = ... = 
Kw{'& 2 n _ i) = N. Then 

OO 

E(i?o - M (2^ - 5) = . 

n— 1 

Proof. Recall that the maximizing element for some observed se- 

quence X only depends on the length n and the observed fraction of ones a. In 
order to obtain an estimate for the total prediction error 'Yfm parti- 

tion the interval [0, 1] into 2" disjoint intervals Lk, such that Ufc=o = [0, 1]. 
Then consider the contributions for the observed fraction a falling in Ik sepa- 
rately: 

OO 

C(fc) = E E (5) 

a^AnHlk 

(compare (3)). Clearly, X)n ~ holds. We define the parti- 

tioning (Ik) as /o = [0, i -h 2 “ 2 ") = [ 0 ,-d 2 iv_i), Ii = [|, 1] = [r?i, 1], and 

Ik = [&k, "dfe-i) for all 2 < k < 2^ — 1. 

Fix k G {2, . . . , 2-^ — 1} and assume a G Ik. Then 

d(“’") = argmin{nD(a||t?) -I- Kw{D)ln2} = argmin{nD(a;||t?)} G 

according to (1). So clearly > (i9k — i9o)^ = holds. Since 

p{a\n) decreases for increasing ja— -dolj we have p{a\n) > p('dfc-il^)- The interval 
Ik has length 2~^~^, so there are at least [n2“^“^J > n2~^~^ — 1 observed 
fractions a falling in the interval. From (5), the total contribution of a G /fc can 
be estimated by 

OO 

C{k) > E2-"'=""(n2-'=-i - l)p(r9fc_i|n). 

n—1 

Note that the terms in the sum even become negative for small n, which does 
not cause any problems. We proceed with 

p{^k-i\n) > ^ exp[-nD{I + 2-'^\\I)] > ^ exp [ - n§2-2'=] 

V 8 • 2~^n V 2n 

according to Lemma 3 and Lemma 2 (ii). By Lemma 4 (z) and (zi), we have 

£ 2 £ (y 

n—1 ^ '' 
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Considering only k > 5, we thus obtain 






3\/^ r>2fc-l 

16 



V2e 



2-1 - 



> 



V3 

16 



3^2-5 - 1 2-2fc-i - 



> 



v^2-5-^i2-ii>-. 



16v^ 



1 



Ignoring the contributions for fc < 4, this implies the assertion. 



□ 



This result shows that if the parameters and their weights are chosen in an 
appropriate way, then the total expected error is of order instead of In Wq^. 
Interestingly, this outcome seems to depend on the arrangement and the weights 
of the false parameters rather than on the weight of the true one. One can check 
with moderate effort that the proposition still remains valid if e.g. wq is twice 
as large as the other weights. Actually, the proof of Proposition 5 shows even 
a slightly more general result, namely the same bound holds when there are 
additional arbitrary parameters with larger complexities. This will be used for 
Example 14. Other and more general assertions can be proven similarly. 

4 Upper Bounds 

Although the cumulative error may be large, as seen in the previous section, the 
instantaneous error is always small. 

Proposition 6. For n > 3, the expected instantaneous square loss is bounded: 

< (ln2)Aw(r9o) y/ 2 {In 2) Kw (dp) Inn ^ 6 Inn 

~ 2n n n 

Proof. We give an elementary proof for the case dp G (i) |) only. Like in the 
proof of Proposition 5, we consider the contributions of different a separately. 
By Hoeffding’s inequality, P(|o; — 'dol > < 26-^^^ for any c > 0. Letting 

c = Vlnn, the contributions by these a are thus bounded by ^ . 

On the other hand, for |a — -dol < recall that dp beats any d iff (2) holds. 
According to Kw{d) < 1, |a — and Lemma 2 (i) and {ii), (2) is already 

implied by |a — 'd| > \J Clearly, a contribution only occurs if d 

beats dp, therefore if the opposite inequality holds. Using \a — 'do| < ;^ again 
and the triangle inequality, we obtain that 

^ )2 < 5c^ + l{ln2)Kw{dp) + y/2{ln2)Kw{dp)c^ 

~ n 

in this case. Since we have chosen c = \/ln n, this implies the assertion. □ 
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One can improve the bound in Proposition 6 to E('i?o ~ ^ Kw^o) 

a refined argument, compare [4]. But the high-level assertion is the same: Even 
if the cumulative upper bound may tend to infinity, the instantaneous error 
converges rapidly to 0. Moreover, the convergence speed depends on Kw{df)) as 
opposed to Thus t? tends to ifo rapidly in probability (recall that the 

assertion is not strong enough to conclude almost sure convergence). The proof 
does not exploit < 1, but only < 1, hence the assertion even holds 

for a maximum likelihood estimator (i.e. = 1 for all i? € 0). The theorem 

generalizes to i.i.d. classes. For the example in Proposition 5, the instantaneous 
bound implies that the bulk of losses occurs very late. This does not hold for 
general (non-i.i.d.) model classes: The losses in [7, Example 9] grow linearly in 
the first n steps. 

We will now state our main positive result that upper bounds the cumulative 
loss in terms of the negative logarithm of the true weight and the arrangement 
of the false parameters. We will only give the proof idea - which is similar to 
that of Proposition 5 - and omit the lengthy and tedious technical details. 

Consider the cumulated sum square error — ■dg)^. In order to 

upper bound this quantity, we will partition the open unit interval (0, 1) into 
a sequence of intervals each of measure 2“*. (More precisely: Each Ik 

is either an interval or a union of two intervals.) Then we will estimate the 
contribution of each interval to the cumulated square error, 

OO 

C{k) = Y. E p(a|n)(t9(“-")-dg)2 

(compare (3) and (5)). Note that G Ik precisely reads G fl 0, 

but for convenience we generally assume d G 0 for all d being considered. This 
partitioning is also used for a, i.e. define the contribution C{k,j) of D € Ik where 
oc ea I j as 

OO 

C{k,j) = Y, ^ p(a|n)(d(“’")-do)2. 

We need to distinguish between a that are located close to dg and a that are 
located far from dg. “Close” will be roughly equivalent to j > k, “far” will be ap- 
proximately j < k. So we get E(d(“>”) -dg)^ = C{k,j). 

In the proof, 

p{a\n) < [na{l — a)] ^ exp [ - nL>(a||dg)] 

X 

is often applied, which holds by Lemma 3 (recall that f < g stands for / = 0{g)). 
Terms like Z?(a||dg), arising in this context and others, can be further estimated 
using Lemma 2. We now give the constructions of intervals Ik and complementary 
intervals Jk- 




302 



J. Poland and M. Mutter 






k = 2: — h 



k — 3: — h 








^ 

= Tg 

h 



h 



1 

2 

J2 



^0 



J3 



1?0 



1 

4 

Ji 



_5_ 

32 



''^0 16 



h 



h 



T_ 

32 



72 



74 



Fig. 1. Example of 
the first four intervals 
for i9o = ^- We have 
an 1-step, a c-step, 
an 1-step and another 
c-step. All following 
steps will be also c- 
steps. 



Definition 7. Let ttg € O be given. Start with Jq = [0, 1). Let Jk-i = 

and define dk = id], — = 2“^+^. Then Ik,Jk C Jk-i are constructed from Jk-i 

according to the following rules. 

^0 € ^ Jk = [^kj'dk + ^dk), Ik = [dk + ^dk,id]), ( 6 ) 

^0 G [dk + t^k,idk + ^dk) ^ Jk = + ^dk,idk + ^dk), (7) 

Ik = [idlDi + ldk)U[D[ + ldk,r,), 

ido G [4 + §4,4) ^ 4 = [4 + 14,4), 4 = [4,4 + hdk). (8) 

We call the fcth step of the interval construction an l-step if (6) applies, a c-step 
if (7) applies, and an r-step if (8) applies, respectively. Fig. 1 shows an example 
for the interval construction. 

Clearly, this is not the only possible way to define an interval construction. 
Maybe the reader wonders why we did not center the intervals around dg. In 
fact, this construction would equally work for the proof. However, its definition 
would not be easier, since one still has to treat the case where Dq is located 
close to the boundary. Moreover, our construction has the nice property that 
the interval bounds are finite binary fractions. Given the interval construction, 
we can identify the D G Ik with lowest complexity: 

= argmin{7fu;(r9) : -d € 7^ fl 0}, 

DI = argmin{7Mc(d) : D G JkL\ 0}, and 
A{k) = max {7W;(-d^) — 7W;(-d^), O}. 

If there is no d G 7^ fl 0, we set = Kw(-dl) = oo. 

Theorem 8. Let 0 C [0,1] be countable, do G 0, and where 

Kw{'d) is some complexity measure on 0. Let A{k) be as introduced in the last 
paragraph, then 

OO OO 

^ E(do - d")2 < 7tu;(do) + ^ 4^- 

n—0 k—1 
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The proof is omitted. But we briefly discuss the assertion of this theorem. It 
states an error bound in terms of the arrangement of the false parameters which 
directly depends on the interval construction. As already indicated, a different 
interval construction would do as well, provided that it exponentially contracts 
to the true parameter. For a reasonable distribution of parameters, we might 
expect that A{k) increases linearly for k large enough, and thus ^ ^ 

remains bounded. In the next section, we identify cases where this holds. 



5 Uniformly Distributed Weights 



We are now able to state some positive results following from Theorem 8. 

Theorem 9. Lei 0 C [0, 1] be a eountahle class of parameters and ido € 0 the 
true parameter. Assume that there are constants a > 1 and b >0 such that 

min \Kw(i)) -.Tde [^9o-2■^do + 2-'=]n6),dy^^?o| > (9) 

a 

holds for all k > aKw{do) + b. Then we have 

oo 

E(t?o - < aKwi^o) + b < Aw(do). 

n— 0 

Proof. We have to show that 



Y V^{k) < aKw{i)o) + b, 

fc=i 

then the assertion follows from Theorem 8. Let ki = \aKw{'&o) + 6+1] and 
k' = k — ki. It is not hard to see that max, 5 g/j, |t? — do| < 2“*+^ holds. Together 
with (9), this implies 



OO /ci OO 

2-Kw{Ui) + Kw(i>,) ^Kw{§i) - 






k—1 k—ki-\-^ 



< fci + Y 2”^ 

fc = fci + l 



k-b 



< fci + Y 2” 






k' + ki-b k' kl — b 



^ / / Uf 

< aKw(i9o) + 6 + 2 + Y^ V ^ Kw{'&o)- 



k'=l 



Observe f + Kwfdo) < _|_ yjKw{'do), J2k' 2 “ ^ by Lemma 4 



(0> Sfc' 2 “ < a. Then the assertion follows. 



□ 
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Letting j = (9) asserts that parameters with complexity Kw{'d) = j 

must have a minimum distance of from "do- That is, if parameters with 

equal weights are (approximately) uniformly distributed in the neighborhood of 
t?o, in the sense that they are not too close to each other, then fast convergence 
holds. The next two results are special cases based on the set of all finite binary 
fractions, 

Qb* = {r? = 0./3 i/? 2 ■ ■ ■ /3n-il : n G N,/3j G B} U {O, l}. 

If r? = 0./3 i/? 2 ■ • ■ /3n-il G Qb*, its length is = n. Moreover, there is a 
binary code for n, having at most n' < [log 2 (n + 1)J bits. Then 

0/3(0/32 • ■ • 0/3n'l/3i . . . /3„_i is a prefix-code for §. For completeness, we can de- 
fine the codes for // = 0, 1 to be 10 and 11, respectively. So we may define a 
complexity measure on Qb* by 

Kw{Q) = 2, Kw{l) = 2, and Kw{'d) = £(//) -I- 2[log2(^('d) -I- l)J for d 0, 1. 

( 10 ) 

There are other similar simple prefix codes on Qb* such that Kw{d) > 

Corollary 10. Let 0 = Qb*, i9o € 0 and Kw{'d) > (■{'&), then J2n < 

Kw{'&o) holds. 

The proof is trivial, since Condition (9) holds with a = 1 and 6 = 0. This is 
a special case of a uniform distribution of parameters with equal complexities. 
The next corollary is more general, it proves fast convergence if the uniform 
distribution is distorted by some function ip. 

Corollary 11. Let : [0, 1] — >■ [0, 1] he an injective, N times continuously dif- 
ferentiable function. Let 0 = (/?(Qb*), Kw(^(p{t)) > £{t) for all t G Qb*, and 
do = p{to) for a to G Qb* • Assume that there is n < N and £ > 0 such that 

dP" ip 

-^^(t) > c > 0 for all t G [to — £, to + e] and 

— — (to) = 0 for all 1 < m < n. 
dt'^ ^ ' ~ 

Then we have 

E(do — d^)^ < nKw{do) + 21og2(n!) — 21og2C -I- nlog 2 £ < nKw{'&o)- 
Proof. Fix j > Kw^-ffo): then 

Kw((p{t)) > j for all t G [to - 2“^, to -I- 2~^] n Qb* . (11) 

Moreover, for all t G [to — 2~\to + 2~^], Taylor’s theorem asserts that 




(p{t) = ip{to) + 



(12) 
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for some i in (or (t,to) if t < to)- We request in addition 2~^ < e, then 

> c by assumption. Apply (12) to t = to + 2“-’ and t = to — 2~^ and 
define k = |"jn + log 2 (n!) — log 2 c] in order to obtain |(^(to + 2“^) — i?o| > 2“^ and 
\ip{to — 2~^)—t)o\ > 2“*. By injectivity of (/?, we see that (/?(t) ^ [dp — 2“^, tto+2“^] 
if t ^ [to — 2“Ato + 2~^]. Together with (11), this implies 

Kw{i)) >j> k-'^^S 2 inl) + log 2 C-l §e[§o- 2-^ rto + 2"'=] n 0. 

n 

This is condition (9) with a = n and b = log 2 (n!) — log 2 C + 1. Finally, the 
assumption 2~^ < e holds if fc > tci = nlog 2 £ + log 2 (n!) — log 2 C+ 1. This gives 
an additional contribution to the error of at most ki- □ 

Corollary 11 shows an implication of Theorem 8 for parameter identification: 
A class of models is given by a set of parameters Qb* and a mapping tp : Qb* — >■ 
0. The task is to identify the true parameter to or its image f)o = p{to)- The 
injectivity of p is not necessary for fast convergence, but it facilitates the proof. 
The assumptions of Corollary 11 are satisfied if p is for example a polynomial. In 
fact, it should be possible to prove fast convergence of MDL for many common 
parameter identification problems. For sets of parameters other than Qb», e.g. 
the set of all rational numbers Q, similar corollaries can easily be proven. 

X 

How large is the constant hidden in “<”? When examining carefully the 
proof of Theorem 8, the resulting constant is quite large. This is mainly due to 
the frequent “wasting” of small constants. Supposably a smaller bound holds as 
well, perhaps 16. On the other hand, for the actual true expectation (as opposed 
to its upper bound) and complexities as in (10), numerical simulations indicate 
that E(-do - -d^)^ < \Kw{'do)- 

Finally, we state an implication which almost trivially follows from Theorem 
8, since there ^ 2\(/c) < is obvious. However, it may be very useful 

for practical purposes, e.g. for hypothesis testing. 

Corollary 12. Let 0 contain N elements, Kw{-) he any complexity function on 
0, and do G 6*. Then we have 

OO 

E(do - d“=)^ < + Kw{do). 

n—1 

6 The Universal Case 

We briefly discuss the important universal setup, where Kw{-) is (up to an addi- 
tive constant) equal to the prefix Kolmogorov complexity K (that is the length 
of the shortest self-delimiting program printing -d on some universal Turing ma- 
chine). Since Yl,k yjK{k) = OO no matter how late the sum starts (other- 

wise there would be a shorter code for large k), we cannot apply Theorem 8. This 
means in particular that we do not even obtain our previous result. Theorem 1. 
But probably the following strengthening of the theorem holds under the same 
conditions, which then easily implies Theorem 1 up to a constant. 
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Conjecture 13. < K{do) +X)fc2 

Then, take an incompressible finite binary fraction G Qb*, be. K(i9q) = 
£(i9o) + K[£{do)). For k > f(t^o)) we can reconstruct i?o and k from Dj, and 

£(ijo) by just truncating ill after £(i9q) bits. Thus K{£)1) + itT(£(i?o)) > Ki'&o) + 
K(^k\do,K{do)) holds. Using Conjecture 13, we obtain 

^ E(r9o - < K{^o) + < £(do) (log 2 £(t?o))', (13) 

n 

where the last inequality follows from the example coding given in (10). So, 
under Conjecture 13, we obtain a bound which slightly exceeds the complexity 
K {’&[)) if -do has a certain structure. It is not obvious if the same holds for all 
computable do. In order to answer this question positive, one could try to use 
something like [11, Eq.(2.1)]. This statement implies that as soon as K{k) > K\ 

for all k > k\, we have X)fc>fci Ki{\og 2 KiY ■ H is possible to 

prove an analogous result for instead of k, however we have not found an 
appropriate coding that does without knowing dp. Since the resulting bound is 
exponential in the code length, we therefore have not gained anything. 

Another problem concerns the size of the multiplicative constant that is hid- 
den in the upper bound. Unlike in the case of uniformly distributed weights, 
it is now of exponential size, i.e. 2 *^F), This is no artifact of the proof, as the 
following example shows. 

Example Ij. Let U be some universal Turing machine. We construct a second 
universal Turing machine U' from U as follows: Let > 1. If the input of U' is 

p, where 1^ is the string consisting of N ones and p is some program, then 
U will be executed on p. If the input of U' is 0^, then U' outputs Otherwise, 

if the input of U' is x with x G \ {0^, 1'^}, then U' outputs \ For 

do = 5 , the conditions of a slight generalization of Proposition 5 are satisfied 

(where the complexity is relative to U'), thus E(d"^ — dp)^ > 2^ . 

Can this also happen if the underlying universal Turing machine is not 
“strange” in some sense, like U' , but “natural”? Again this is not obvious. One 
would have to define first a “natural” universal Turing machine which rules out 
cases like U' . UN is not too large, then one can even argue that U' is natural 
in the sense that its compiler constant relative to U is small. 

There is a relation to the class of all deterministic (generally non-i.i.d.) mea- 
sures. For this setup, MDL predicts the next symbol just according to the mono- 
tone complexity Km, see [12]. According to [12, Theorem 5], 2“*-™ is very close 
to the universal semimeasure M (this is due to [13]). Then the total predic- 
tion error (which is defined slightly differently in this case) can be shown to be 
bounded by Km{x^ooY [14]. The similarity to the (unproven) bound (13) 
“huge constant x polynomial” for the universal Bernoulli case is evident. 
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7 Discussion and Conclusions 

We have discovered the fact that the instantaneous and the cumulative loss 
bounds can be incompatible. On the one hand, the cumulative loss for MDL pre- 
dictions may be exponential, i.e. Thus it implies almost sure convergence 

at a slow rate, even for arbitrary discrete model classes [7]. On the other hand, 
the instantaneous loss is always of order ^Kw{’do), implying fast convergence in 
probability and a cumulative loss bound of Kw^ifo) In n. Similar logarithmic loss 
bounds can be found in the literature for continuous model classes [2]. 

A different approach to assess convergence speed is presented in [4]. There 
in index of resolvability is introduced, which can be interpreted as the difference 
of the expected MDL code length and the expected code length under the true 
model. For discrete model classes, they show that the index of resolvability con- 
verges to zero as ^Kw{'&o) [4, Equation (6.2)]. Moreover, they give a convergence 
of the predictive distributions in terms of the Hellinger distance [4, Theorem 4] . 
This implies a cumulative (Hellinger) loss bound of Kw{'do)lnn and therefore 
fast convergence in probability. 

If the prior weights are arranged nicely, we have proven a small finite loss 
bound Kw{'&o) for MDL (Theorem 8). If parameters of equal complexity are 
uniformly distributed or not too strongly distorted (Theorem 9 and Corollaries), 
then the error is within a small multiplicative constant of the complexity Kw^ito). 
This may be applied e.g. for the case of parameter identification (Corollary 11). 
A similar result holds if 0 is finite and contains only few parameters (Corollary 
12), which may be e.g. satisfied for hypothesis testing. In these cases and many 
others, one can interpret the conditions for fast convergence as the presence of 
prior knowledge. One can show that if a predictor converges to the correct model, 
then it performs also well under arbitrarily chosen bounded loss-functions [15, 
Theorem 4]. Moreover, we can then conclude good properties for other machine 
learning tasks such as classification, as discussed in the introduction. From an 
information theoretic viewpoint one may interpret the conditions for a small 
bound in Theorem 8 as “good codes” . 

The main restriction of our positive result is the fact that we have proved it 
only for the Bernoulli case. We therefore argue that it generalizes to arbitrary 
i.i.d settings. Let G [0, 1]'^, J2i ^ = 1 be a probability vector that generates 
sequences of i.i.d. samples in {1, . . . , N}°°. Assume that 'do stays away from the 
boundary (the other case is treated similarly) . Then we can define a sequence of 
nested sets in dimension iV — 1 in analogy to the interval construction. The main 
points of the proof are now the following two: First, for an observed parameter 
a far from 'do, the probability of a decays exponentially, and second, for a close 
to 'do, some 'd far from 'do can contribute at most for short time. These facts hold 
in the general i.i.d case like in the Bernoulli case. However, the rigorous proof of 
it is yet more complicated and technical than for the Bernoulli case. (Compare 
the proof of the main result in [2].) 

We conclude with an open question. In abstract terms, we have proven a con- 
vergence result for the Bernoulli (or i.i.d) case by mainly exploiting the geometry 
of the space of distributions. This is in principle very easy, since for Bernoulli 
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this space is just the unit interval, for i.i.d it is the space of probability vectors. 

It is not obvious how (or if at all) this approach can be transferred to general 

(computable) measures. 
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Abstract. We consider a two- layer network algorithm. The first layer 
consists of an uncountable number of linear units. Each linear unit is an 
LMS algorithm whose inputs are first “kernelized.” Each unit is indexed 
by the value of a parameter corresponding to a parameterized reproduc- 
ing kernel. The first-layer outputs are then connected to an exponential 
weights algorithm which combines them to produce the final output. We 
give loss bounds for this algorithm; and for specific applications to pre- 
diction relative to the best convex combination of kernels, and the best 
width of a Gaussian kernel. The algorithm’s predictions require the com- 
putation of an expectation which is a quotient of integrals as seen in 
a variety of Bayesian inference problems. Typically this computational 
problem is tackled by mcmc, importance sampling, and other sampling 
techniques for which there are few polynomial time guarantees of the 
quality of the approximation in general and none for our problem specif- 
ically. We develop a novel deterministic polynomial time approximation 
scheme for the computations of expectations considered in this paper. 



1 Introduction 

We give performance guarantees and a tractable method of computation for the 
two-layer network algorithm k-lms-net. The performance guarantees measure 
online performance in a non-statistical learning framework introduced by Little- 
stone [13,14]. Here, learning proceeds in trials t = 1,2,. . In each trial t the 
algorithm receives a pattern Xf. It then gives a prediction denoted yt G K. The 
algorithm then receives an outcome yt € K, and incurs a loss L(yt, yt) measuring 
the discrepancy between yt and yt] in this paper L{yt,yt) = {yt — ijtY ■ A relative 
loss bound performance guarantee bounds the cumulative loss of the algorithm 
with the cumulative loss of any member c : df— >-]R of a comparison class C of 
predictors plus an additional term. These bounds are of the following form, for 
all data sequences S = {{xi,yi), {x 2 ,y 2 ), ■ • ■ , {xn,yi)), 

i i 

J2^{yt ,yt) < J2^{yt, c{xt)) + 0{r{S,C,c)) Vc G C 
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where r{S,C,c) is known as the regret, since it measures our “regret” at using 
our algorithm versus the “best” predictor c in the comparison class. In the ideal 
case the regret is a slowly growing function of the data sequence, the comparison 
class, and the particular predictor. Surprisingly, such bounds are possible without 
probabilistic assumptions on the sequence of examples. 

The architecture of the k-lms-net algorithm is a simple chaining of two well- 
known online algorithms. The first layer consists of an uncountable number of 
linear units. Each unit is an lms algorithm [4] whose inputs are first “kernelized.” 
Each unit is indexed by the value of a parameter (a G [0, 1]) corresponding to 
a parameterized reproducing kernel [2], for example a Gaussian kernel and its 
width /cq(v,w) = . The first-layer outputs are then directed to an 

exponential weights algorithm [20,14,12] which combines them to produce the 
final output (prediction). This topology gives an algorithm whose comparison 
class (hypothesis space) is a union of linear spaces of functions. 

The results of this research are twofold. First, we give a general bound for 
the K-LMS-NET algorithm. This general bound is then applied to the problem 
of predicting almost as well as any function i) from the space defined by the 
“best” convex combination of two kernels and ii) from the space defined by the 
“best” width of an isotropic Gaussian kernel. Second, though the second layer 
combines an uncountable number of outputs from the first layer we show that the 
final prediction may be well-approximated in polynomial time. This prediction is 
an expectation (5) whose form is a quotient of integrals which does not have an 
analytic closed form; thus we resort to a novel sampling scheme. The significance 
of our sampler is that it produces a provably polynomial time approximation to 
our predictions. The sampler is deterministic, and relies on finding the critical 
points of the functions to be integrated; this leads to a limitation on the types 
of parameterized kernels for which we can give predictions in polynomial time. 
The applications for which we give bounds are among those whose predictions 
may be approximated in polynomial time by our sampling scheme. 

1.1 Related Work 

In [7] Freund applied the exponential weights algorithm to predicting as well as 
the best “biased coin” that modeled a data sequence, with an uncountable set of 
predictors each corresponding to a probability of “heads.” In [21] an algorithm 
similar to ridge regression was given, where the set of predictors corresponded 
to each linear function on K"; these were then combined with an exponential 
weights algorithm. For those algorithms exact computation of the prediction 
was possible; exact computation, however, is not possible for the k-lms-net 
algorithm. 

In [4], relative loss bounds are proven for the classical lms algorithm; the 
bounds naturally apply to kernel lms. Some recent relative- loss bounds for vari- 
ants of kernel lms have appeared in [11,9]. In contrast, our kernel function has 
a free parameter; hence our comparison class (hypothesis space) is not a single 
kernel space, but a union of kernel spaces. 

The problem of learning the best parameters of a kernel function has been 
modeled as a regularized optimization problem in [16,15,23]. Methods based on 
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Gaussian process regression have proven to be practical for learning or predicting 
with a mixture of kernel parameterizations, for which we cite only a few of the 
many offline [22,8] and online algorithms [6,19] developed. The parameterized 
kernel function now corresponds to a parameterized covariance function. A key 
difference in focus is that our free parameter is a one-dimensional scalar, whereas 
in Gaussian process regression the free parameter vector is often in the hundreds 
of dimensions. The one-dimensional case we consider is certainly much simpler 
than the multidimensional case. However, we make no statistical assumptions 
on the data generation process, we give non-asymptotic relative loss bounds and 
we observe that even in the “simple” one-dimensional case it is not obvious how 
to sample so that predictions of guaranteed accuracy are produced in polynomial 
time. 

The predictions (5) of k-lms-net algorithm are of the following (simplified) 
form. 



^ fo yK«) exp(-L[i^t] (a))da ^ 

/g exp(-T[i_t](a))da 

here yj(a) and are the outputs and cumulative losses, respectively, 

of each of the kernel lms algorithms at time t. In the applications to be dis- 
cussed yl(a) and (](«) reduce to either polynomials or to polynomials after a 
change in variables. The problem of estimating such expectations is common in 
Bayesian statistics. Since we cannot expect to compute yt exactly, we consider 
that a good polynomial time approximation scheme for pt should have the fol- 
lowing property: for every e € (0, 1), an absolute error approximation yt, should 
satisfy \yt — jltl < e and be computable in time polynomial in 0{-). It is also 
natural to extend the previous to a randomized approximations schemes; how- 
ever the scheme we produce is fully deterministic. Hoeffding bounds [10] allow 
one to produce an absolute error approximation for the Monte-Garlo integration 
of J f dpL, with y = where Xi is sampled from the probability 

measure y, and e = O(^). Hoeffding bounds are not applicable to the approx- 
imation of (1) as it is a nontrivial problem in itself to produce samples from 
the distribution in polynomial time; nor can we can we apply 

Hoeffding bounds individually to the integrals in the numerator and the denom- 
inator, as absolute error bounds do not “divide” naturally. A variety of other 
bounds have been proven for numeric integration within the information-based 
complexity framework [18]; however, as these are absolute error bounds they are 
likewise not applicable. Another approach to the approximation of equations of 
the form (1) which has proven useful in practice for similar applications is to 
use one of the many variants of MCMC sampling [1]. We are not aware of any 
bounds for MCMC sampling methods which give a polynomial time guarantee 
for randomized approximation to yt which are applicable to this research. Our 
sampling methodology is discussed in Sect. 3; the key to our method is to pro- 
duce a sampler for 1-d integrals with a provable relative error approximation, 
which unlike the absolute error approximation, is, loosely speaking, closed under 
division. 
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1.2 Preliminaries 

The symbol X denotes an abstract space, for example X could be a set of strings. 
Given a vector space (F; +), the sum of two subsets F and G of G is defined by 
F + G= {/ + (/:/gF, (/GG}.A Hilbert space % denotes a complete inner 
product space. The inner product between vectors v and w in "H is denoted 
by (v,w) and the norm by ||v||. In this paper, we will consider Hilbert spaces 
determined by a reproducing kernel k : X x T— >-K. The prehilbert space induced 
by kernel k is the set H^. = span({fc(a;, -)}va;eAr) and the inner product of / = 
and g = , •) is {f,g) = E™ a;' ). The 

completion of Hk is denoted TLk- Two kernels kg : X x T— and ki \ X' x T'— 
are termed domain compatible if X = X'. The reproducing property of the kernel 
is that given any / G 'Hk and any x £ X then f{x) = {/{■), k{x, •)); other useful 
properties of reproducing kernels, and introductory material may be found in 
[5] . In this paper we are particularly interested in parameterized kernels ka with 
an associated Hilbert Space Ha, inner product {■,-)at and norm H-Hq,, for every 
a G [0, 1]. We denote the Lebesque measure of a set A by g{A). 

An absolute e- approximation of j/ G K by y G K satisfies 

\y-y\<£ (2) 

denoted by y^eV- A relative e- approximation of t/ G by y G M"*" satisfies 

(1 -e)y < y < (i + e)y, (3) 

V 

which is denoted by y^eV- A polynomial e-approximation scheme requires for 
each e G (0, 1) that we can compute y s.t. y y in time 0(i). For simplicity, 
we describe the time complexity of our algorithms in terms of a naive real- 
valued model of computation, where arithmetic operations on real numbers, 
e.g., addition, exponentiation, kernel evaluation, etc., all require 0(1) “steps.” 



2 The K-LMS-NET Algorithm 

The following general bound for the k-lms-net algorithm is applied to predict- 
ing as well as the convex combination of two kernels and predicting as well as 
the best width of a Gaussian kernel over a discretized domain. 

Theorem 1. The k-lms-NET algorithm with parameterized kernel function ka 
(a G [0, l]j with any data sequence {{xi,yi), {x 2 ,y 2 ), ■■■ , {xn,yt)) G {X, [ri,r 2 ])^ 
when the algorithm is tuned with constants r\,r 2 , and rj, the total square loss of 
the algorithm will satisfy 



i re 

t=i 




n)^ In 



1 

y{A) 

(8) 



for all measurable sets A C [0, 1] for all tuples of functions {ha)aeA G Y\a(^A'^°‘ 
and for all constants and,X^\^, where for all a G A the following four 

conditions must hold: Et=i-^(yt) ^ ^Aj ll^alla ^ Vt : ka{xt,xt) < 



A^, and 77 = [1 -|- 



HaXa 



)X\] 



21-1 
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Parameters: X: a pattern space; 

ka'.XxX-^M.: a parameterized kernel function (aG[0,l]); 

{Ha} : a set of Hilbert spaces induced by ka’, 
rj : a learning rate; [ri , V 2 ] : an outcome range. 

Data: An online sequence ((xi, yi), (*2, J/2), • • • ,{xe,ye)) G {X,[ri,r2]Y- 
Initialization: r = (t2 — ri), 'Wa i{x) — 0, wf (a) = 1, 

^*(a:) = max(ri, min(r 2 , a;)) ; ^T{w) = max(exp(— |), ui). 
for t = 1, . . . , £ do 

Predict: receive Xt, 

t-i 

yl{a) = =v^(yj -y){a))ka{xj,xt) (4) 

^ — fi m w 

Jo w“(a)do; 

Update: receive yt, 

^a,t+i{x) = Wa^t(x) + v{yt - yl{a))ka{xt, x) (6) 

L[i,t\{a) = L[i,t_ij(a) -f {yt - yl{a)f 

wr+i(a) =<^r(exp(^L[i,t](a))) (7) 

end 

Algorithm 1: k-lms-net algorithm 



The bound is a straightforward chaining of the well known loss bounds [4] of the 
LMS (GD) algorithm and a variant of the exponential weights algorithm [20,14, 
12] that implements direct clipping of the inputs to guarantee a loss bound and 
an amortized clipping of the cumulative loss to enable efficient sampling. 

The generic bound given is neither a pure relative loss bound nor does it 
give an indication of whether the k-lms-net is polynomially tractable for a 
particular parameterized kernel. The bound is not a “pure” relative loss bound 
insofar as the regret (the final term of (8)) for any particular predictor is infinite 
(since A is then a point set thus = 00 ). A pure relative loss bound may be 
given if we can determine how the loss and the norm of a particular predictor 
in Ha' is related to near comparable predictors in Ha" when \a' — a"\ is small. 
In the following we “flesh out” the generic bound of Theorem 1 by giving pure 
relative loss bounds for two particular parameterized kernels; then in Sect. 3 we 
sketch how the prediction with these kernels is computable by a polynomial-time 
approximation scheme. 



2.1 Applications to Specific Parameterized Kernels 

Relative loss bounds are given in Theorems 4 and 5 for a parameterized kernel 
which is a parameterized convex combination of kernels and for a Gaussian 
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kernel with a parameterized width, respectively. Each of these bounds given are 
in terms of adjunct norms C(-) and 5(-) on the kernel spaces rather than the 
norms inherited from the underlying kernel space. The norms C(-) and 5(-) are 
tighter and weaker, respectively, than their inherited norm || -Hq,. The proofs of 
the theorems follow directly from Theorem 1 in conjunction with Lemmas 5 
and 6 which appear in Appendix A. 

Predicting Almost as Well as the Best Convex Combination of Two 
Kernels. We consider the convex combination of two domain-compatible ker- 
nels. Hence our parameterized kernel function ka = {l — a)kQ + ak\ where fco and 
k\ are two distinct kernel functions. We further require in this abstract that the 
corresponding Hilbert spaces "Ho and "Hi be disjoint except for the zero function, 
i.e., "Hon^i = {0}- Typical kernel spaces that are disjoint except for the zero, 
include spaces derived from polynomial kernels of differing degree and wavelet 
kernels at two distinct levels of resolution. The following useful theorem from 
Aronszajn [2] gives a basis for our following observations. 

Theorem 2 ([2]). If ki are the domain- compatible kernels of Hilbert spaces Hi 
with the norms H-Hj, then k = ko~\-ki is the kernel of Hilbert space H = Ho~\-Hi 
of all functions f = fo + fi with fi G Hi, and with the norm defined by 



= inf 



ll/olln + ll/i 



nil 



which is the infimum taken for all decompositions f = fo + fi with fi € Hi- 

Therefore given a', a” G (0, 1) the three sets Ha', Ha", and Uae[o i] contain 
exactly the same functions; however in general ||/||„/ yf ll/IL"- Observe that 
with the assumption Hof]Hi = {0} any function / G Uae[o i] ^ unique 

decomposition f = fo + fi with fi G Hi- Given the decomposition we can 
compute the norm of / in any particular Ha via 

ll/ll^ = ^ll/ollo + -|l/ill?, «G(0,1) , (9) 

I — a a 

since for a scaled kernel k' = (3k the norm is rescaled as ||/||^/ = ^11/11^ • We 
may define the following norm over Ho + H\. 

Definition 1. Given domain- compatible kernels ko and k\ such that Ho(\H\ = 
{0} let ka = {3 — O')ko -\- aki then given f G Ho + Hi- Define C{f) by 

C\f)= inf 11/11^ . (10) 

Q:G[0,1J 

The following theorem gives a canonical form for C(/). 

Theorem 3. Given f = fo + fi such that fi G Hi and HoffHi = {0} then 

C"(/) = (ll/o|lo + ll/illif • (11) 

Proof. The theorem immediately follows from the substitution of the minimizer 
^ = ll/o'i'l + l'i'/dl ° 
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A recent generalization of this canonical form is given in [15, Lemma A. 2]. 

Theorem 4. Given the k-lms-net algorithm tuned with learning rate g, an 
outcome range [ri,r 2 ], parameterized kernel /Cq = (1 — o;)fco -I- ak\ constructed 
from two domain- compatible kernels ko and k\ such that = { 0 }; « data 

sequence ((xi, t/i), ( 0 : 2 , 2 / 2 ), • ■ • ,{xi,ye)) & (A", [ri, r 2 ])^ then the total loss of the 
algorithm satisfies 
e i 

J2L{yt, yt) < Y^Liyt, h{xt)) + 2 \/IhX + 

-\- 2 (r 2 — ri)^ max(2 InClft.) -I- In — h In -,ln4) (12) 

c 3 

for all h G Ho + Hi and all constants L,H,X, and c G (0,1] such that the 

following four conditions hold: 77 = [(1 -I- ^^)]~^> o,nd 
e 

{h) c < , ’S^L{yt,ha{'x.t)) < L, and sup ka{xt,xt) < X^ ■ 

(13) 

Predicting Almost as Well as the Best Width of a Gaussian Kernel. 

In the following we define the surfeit of a function. In this abstract we avoid 
the technicalities of defining the surfeit for the complete Hilbert space Hk] we 
consider the definition only on the prehilbert space Hk- 

Definition 2. Given a positive kernel (dx,y G : k{x,y) > 0), let f G Hk; 
then define the surfeit by 

52(/)=inf[||/+f + ||/-f] . (14) 

The infimum is taken over all decompositions /^ + /“ = f, where /+ = 
Srft>o •) f~ = Pik{xi, ■) are a positive linear and negative 

linear combination of kernel functions, respectively, such that f = f^ + f~ = 

The infimum exists since 0 < ||/||^ < S^{f)- 

Theorem 5. Given the k-lms-NET algorithm with learning rate rj, an outcome 
range [ri,r 2 ] with max(|ri|, |r 2 |) > 1, a parameterized (a G [0, 1]^ Gaussian ker- 
nel, fca(vi,V 2 ) = exp(— socr||vi — V 2 |l^) with fixed scale constant sq > 1 over 
the domain [xi,a: 2 ]" x [xi,X 2 \^ with associated prehilbert spaces Ha a data se- 
quence ((xi,2/i), (x 2 ,i/ 2 ), • ■ • ,(x£, 2 /^)) G ([xi,a: 2 ]”, [ri,r 2 ])^, and the constants 

c G (0, 1], So > 1, A > 0, H > 1 -\- c and with 77 = [(1 -I- '^)]~^ then the total 
loss of the algorithm satisfies 

i t 

'^L{yt,yt) < '^L{yt,ha{y:t))+‘2HZ.H + H'^ + c-\-2{r2-rif[ln£-\-2lnS{ha) 

+ In 5 o + Inn + 21 n(x 2 — Xi) + lnmax(|ri|, |r 2 |) + In - + In 5] (15) 

c 
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for all ha G Uae[o,i]^« that \\ha\\a + c< H'^, ^a(xt)) + c < L, 



and a G 




c 

5so^max(|ri|, |r 2 |)n(x 2 — xxYS‘^{ha) 



(16) 



The previous bound is given without regard of the computability of the pre- 
dictions. If we restrict the data sequence to a discretization of the unit interval 
we can then apply the methods of Sect. 3 (in particular see Claim 2) to obtain 
polynomially tractable approximate predictions. 



3 Computing the Predictions of k-lms-net 

In the previous section we gave both a generic bound, and bounds for two specific 
applications of the k-lms-NET algorithm. Here we consider how the predictions 
may be computed. Rather than computing the predictions exactly we give a 
polynomial-time absolute e-approximation scheme to the predictions (see (5)) 
of the K-LMS-NET algorithm. The scheme is a deterministic sampling algorithm 
that separately approximates the numerator and denominator of the quotient of 
integrals that define the predictions of the k-lms-NET algorithm. 

The sampling methodology builds on the following three ideas. First, by 
obtaining a relative e-approximation on an integral; this automatically gives a 
relative e-approximation for a quotient of integrals (cf. (5)) since relative error 
approximations (aka “significant digits”) are closed under division. A good rel- 
ative error approximation is also a good absolute error approximation up to a 
magnitude scaling constant. Second, to minimize the number of required sam- 
ples we must concentrate samples in areas of large magnitude. Third, since the 
areas of large magnitude are co-determined by the critical points of the function 
to be approximated, the inspection of the analytic form gives both a bound on 
the number of the critical points and a method to find the critical points (areas 
of large magnitude) . 

In the following we first consider the cost in terms of relative loss bounds for 
using approximate predictions. We then consider the analytic forms of the func- 
tions to be integrated, both in general and then for our particular applications. 
Finally we give the details of our sampling methodology. 



3.1 Additional Regret for Approximate Predictions 

Rather than fixing the quality of the absolute e accuracy of our approximate 
predictions, we advocate using a schedule {et}. In the following we see that a 
schedule which gradually increases the accuracy of our approximate predictions 
allows the additional cumulative regret incurred to be bounded by an 0(1) term. 

When an exact prediction of the k-lms-net algorithm is replaced by absolute 
ej-approximate prediction on trial t the additional “approximation regret” in- 
curred on that trial may be bounded by 2et\yt — yt\ + ^'i- Therefore we may bound 
the additional cumulative regret for using approximate predictions by cq G (0, 1) 
with a decreasing schedule for {ej of e* = 6.5max([r,-n]T)(t+i) inRt+i) ^ recalling 
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that the outcomes and predictions are contained in [ri, r 2 ]. Thus with the above 
schedule of {et}, and given that the approximate predictions are obtainable in 
time polynomial in O(^) on trial t, then the additional cumulative regret is 
bounded by 0 < cq < 1 and the cumulative running time of the algorithm is 
polynomial in the number of trials. 



3.2 Analytic Forms of the Prediction and Loss Lunctions 

We compute the predictions yt of the k-lms-net algorithm (cf. (5)) by maintain- 
ing explicit symbolic representations of j(x), yj(a) and (j (a); the symbolic 
representations may then be exploited to find the critical points. 

Below we give an explicit representation of (omitting y\{a) and 

^[i,i](o^) as they follow directly) by expanding the recurrence (6) giving the 2* — 1 
terms below 

t 

where Tt>,k{x) = ^ 

and (^215^22) ^ ’’’ ^ 1 j ^ik') ^OL {p^ik t 

For example, 4 (x) = y Ya=i yika{xi,x) + Tfyika{xi,X2)ka{x2, x^)ka{xz, x) - 
ff [yi ka{xi, X 2 )ka {x 2 ,x)+yika{xi, X3) {x^ ,x)+y 2 ka (x 2 ,xs)ka (xs , x)] . Clearly 
we cannot expect to give a polynomial time algorithm if we manipulate this 
representation directly, thus the applications we consider are cases where (17) 
algebraically collapses to a polynomial-sized representation. 

In the following claims we give the representations of the functions needed 
to compute the predictions of the k-lms-net algorithm from the applications 
of Theorems 4 and 5. The proofs of these claims are straightforward and are 
omitted for reasons of brevity. 

Claim 1 In the k-lms-net algorithm with a kernel fca = (1 — a)ko + ak\, the 
first layer weight function may he expressed as 

t-i 

= '^Pt,z(a)ko(x^, x) + qtj(a)ki(xi, x) 

i=l 

where pt^i{a) andqty{a) are polynomials of degree i in a. Therefore, the functions 
yl{a) and T[i j](a) may he expressed as polynomials in a of degree t — 1 and 2t — 2 
respectively. 

We discretize the input data to obtain a tractable method to predict with a 
Gaussian kernel. The following claim quantifies the size of the representation of 
the functions to be sampled by the degree of discretization of the input data. 

Claim 2 In the k-lms-net algorithm with the parameterized Gaussian kernel 
fca(v,x) = with fixed scale constant sq G R'*’ o,nd with the dis- 
cretized interval X = {0, . . . , 1} , the first layer weight function may he 

expressed as 
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t — l nm‘^(i—l) 

w/iere pt,i(a) = ^ Ct,^,j 

i=l j=0 

— 3QO: 

with each Ct^ij G K. Applying the change in variable a = e to the functions 
yj(a) and L[i j](a) gives polynomials in a of degree nmfft— 1) and 2nwf{t— 1) 
respectively. 




3.3 Finding Critical Points 

We proceed by dividing functions into piecewise monotonic intervals. This means 
finding their critical points, or the zeros of the derivative. The problem of finding 
zeros of a polynomial has been called the Fundamental Computational Problem 
of Algebra [24]. There is a vast literature regarding this problem some general 
references are [17,24,3]. We do not need to actually find the zeros, we only need 
to isolate them as defined below. 

Definition 3. The fc-isolation of measure S of the zeros of a function f: [a, &]— 
is a list of j < k intervals {[ai, 6i], . . . , [aj,bj\\ such that if f{r) = 0 then there 
exists an i s.t. r G [a*, bi], and also the sum total measure of the intervals is S. 

Observe that in the above we do not precisely find roots but isolate them as 
some intervals could have multiple roots and others none. 

Definition 4. The composition of functions /(ct(-)) is called a a-polynomial 
(polynomial after a change in variable to a) if f is a polynomial and a : K — is 
continuously differentiable and Vx G K : cr'(x) 0 (a is then strictly monotone 
without inflection). The degree of a a-polynomial fieri')) is the degree of f. 

Every polynomial is a a-polynomial where a is identity function. 

Claim 3 Given a a-polynomial /(a(-)) : [0, 1]— >-]R of degree s, there exists an 
s-isolation of the zeros of measure 2 ~p, the isolation is computable in time poly- 
nomial in p and in s. 

Any algorithm that can efficiently act as a root- existence oracle for an interval 
[a, b] of a polynomial which returns TRUE if there exists roots in [a, b] and 
FALSE otherwise can be subordinated within a bisection algorithm to compute 
an s-isolation. In this abstract, we do not actually give an algorithm to compute 
an s-isolation efficiently (see [17,24,3] for algorithms where, e.g., the Euclidean 
Algorithm may efficiently serve as an oracle), for reasons of brevity. 

Corollary 1. Given a -polynomials /(a(-)) : [0, 1]— >-]R and g(a(-)) : [0,1]— of 
degree si and S 2 respectively , and letting I = and m = gia)e^^'^\ then there 
exists a Si — 1 and S 2 (si — 1) isolation of I' and m' respectively of measure 2 ~p 
computable in time polynomial in Si, S 2 and p. 

Proof. Omitted for the sake of brevity. 
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3.4 Deterministic Piecewise Monotone Sampling 

Our sampler functions as follows. The quantity we wish to estimate is a quotient 
of integrals. Relative error approximations (i.e., “significant digits”) are closed 
under division. Hence we develop a sampler for a single integral for which we 
can give a relative e-approximation scheme. Our intuition from absolute error 
approximations may suggest that a quantity to bound is the maximal slope of the 
function to be integrated; this quantity is of less use than the relative variation 
of a positive function, i.e., For a positive monotone function we will 

require a quantity of samples logarithmic in the relative variation. With a bound 
for monotone functions we can generalize to piecewise monotone, this requires 
that we isolate the critical points of our function. In the previous section, the 
applications chosen lead to functions for which it is easy to find the critical 
points. This leads to a method that samples exponentially more often in areas 
of large volume. We note that the sampler here has been designed to directly 
“prove a bound”; using the techniques here it is possible to design a sampler 
that also proves the bound, but which is considerable more adaptive (uses fewer 
samples), and hence more useful in practice. 

In the following three lemmas we give simple algebraic results about e- 
approximations . 

'f ^ 'f ^ T 

Lemma 1. Suppose d«ea and then {a + + b). 

Lemma 2. Suppose and b < B then b^ 2 Beb. 

Lemma 3. Suppose d«ea and b~Jo then |~ 3 ef for all e G (0, |). 

The following scale-invariant theorem is the key to our sampling methodology. 

Theorem 6. Given a continuous nondecreasing function f : [a, 6]— let y = 
f{a)da. Define z = j^; then there exists a relative e- approximation for y 
which requires |"| (^ -I- l) In^] -|- 2 samples (evaluations) of f. 

The proof in Appendix A details how the samples are chosen. 

The following lemma demonstrates that a good relative e-approximation may 
be obtained by well-approximating a function on a subset of its domain if the 
measure of the non-approximated subset times the function’s relative variation 
is sufficiently small. 

Lemma 4. Given measurable sets E' C E with p,{E) = 1 and a continuous 
function f , such that \/x G E : 0 < a < f{x) < b, define A = y{E — E') and 
z = ^. Then if yK.^, fdfx it is also case that y^^ fdfv when e' + Az < e. 

Proof. Omitted for the sake of brevity. 

We now summarize the process for approximating fdy : i) we divide / into 
monotonic regions by isolating the critical points in sufficiently small intervals 
(cf. Claim 3 and Corollary 1); ii) the integral of each monotonic regions is 
then separately approximated (cf. Theorem 6); and iii) the separate approxi- 
mations are then summed without the isolated intervals (cf. Lemma 4) to obtain 
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a Jg fdy. In Appendix A Lemmas 8 and 9 are given. These detail the sepa- 
rate approximation of the denominator and numerator of (5). Their proofs follow 
the basic sketch above except that additional points of the functions need to be 
isolated in order to properly clip the functions. The following theorem combines 
Lemmas 8 and 9 to demonstrate the computation of an absolute e-approximation 
to a prediction of k-lms-NET. 

Theorem 7. Given a a-polynomial /(cr(-)) : [0, oo) of degree s and z € 
(l,oo), and a a-polynomial g(a(-)) : [0, 1]— >■(— oo, oo) of degree t we may compute 
an absolute e- approximation 



_a max(ri, min(g((r(a)), r 2 )) max(e ^)da 

in time polynomial in s, t, ln(z(l -I- r 2 — ri)), and e~^ . 



(18) 



Acknowledgments. I thank Massimiliano Pontil for valuable discussions and 

for pointing out the inspirational kernel sum theorem from Aronszajn [2]. 

References 

[1] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to 
MCMC for machine learning. Machine Learning, 50(l-2):5-43, 2003. 

[2] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337- 
404, 1950. 

[3] L. Blum, F. Cucker, M. Shub, and S. Smale. Complexity and real computation. 
Springer- Verlag, 1998. 

[4] N. Cesa-Bianchi, P. Long, and M. Warmuth. Worst-case quadratic loss bounds 
for on-line prediction of linear functions by gradient descent. IEEE Transactions 
on Neural Networks, 7(2):604-619, May 1996. 

[5] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. 
Cambridge University Press, Cambridge, UK, 2000. 

[6] L. Csato and M. Opper. Sparse on-line gaussian processes. Neural Computation, 
14(3):641-668, 2002. 

[7] Y. Freund. Predicting a binary sequence almost as well as the optimal biased 
coin. In Proc. of COLT., pages 89-98. ACM Press, New York, NY, 1996. 

[8] M. Gibbs and D. MacKay. Efficient implementation of gaussian processes (draft 
manuscript), 1996. 

[9] M. Herbster. Learning additive models online with fast evaluating kernels. In 
COLT 2001, Proceedings, volume 2111 of LNAI, pages 444-460. Springer, 2001. 

[10] W. Hoeffding. Probability inequalities for sums of bounded random variables. 
Journal of the American Statistical Association, 58(301):13-30, Mar. 1963. 

[11] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. In 
NIPS 14 , Cambridge, MA, 2002. MIT Press. 

[12] J. Kivinen and M. K. Warmuth. Averaging expert predictions. Lecture Notes in 
Computer Science (EUROCOLT), 1572:153-167, 1999. 

[13] N. Littlestone. Learning when irrelevant attributes abound: A new linear- 
threshold algorithm. Machine Learning, 2:285-318, 1988. 




Relative Loss Bounds and Polynomial-Time Predictions 



321 



[14] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Informa- 
tion and Computation, 108(2):212-261, 1994. 

[15] C. Micchelli and M. Pontil. Learning the kernel function via regularization, Dept, 
of Computer Science, University College London, Research Note: RN/04/12, 2004. 

[16] C. S. Ong, A. J. Smola, and R. C. Williamson. Hyperkernels. In Neural Informa- 
tion Processing Systems, volume 15. MIT Press, 2002. 

[17] V. Y. Pan. Solving a polynomial equation: Some history and recent progress. 
SIAM Review, 39(2):187-220, 1997. 

[18] J. F. Traub and A. G. Werschulz. Complexity and Information. Cambridge 
University Press, Cambridge, 1998. 

[19] J. Vermaak, S. J. Godsill, and A. Doucet. Sequential bayesian kernel regression. 
In NIPS 16. MIT Press, Cambridge, MA, 2004. 

[20] V. Vovk. Aggregating strategies. In Proc. 3rd Annu. Workshop on Comput. 
Learning Theory, pages 371-383. Morgan Kaufmann, 1990. 

[21] V. Vovk. Competitive on-line statistics. Bull, of the International Stat. Inst., 1999. 

[22] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In 
NIPS 1995, Cambridge, Massachusetts, 1996. MIT Press. 

[23] Q. Wu, Y. Ying, and D.-X. Zhou. Multi-kernel regularized classifiers. Submitted 
to Journ. of Complexity, 2004. 

[24] C. K. Yap. Fundamental problems of algorithmic algebra. Oxford Uni. Press, 2000. 



A Additional Proofs 



Lemma 5. Given domain- compatible kernels kg and k\ such that TLci(\hLi = 
{0}, let ka = {\ — a)ko -\- ak\. Then given f G TLo -\- Hi, we have Vc G (0, 1] : 
V(5 G [0,min(j, that there exists 0 < a' < a” < 1 with a” — a' = S such 

that Va G [a' , a"], it is the case that ||/||^ < C^(f) + c. 

Proof. Let /o + /i with fi &Hi. Without loss of generality assume that ||/i||i < 
ll/ollo. Set a' = recalling that |1/||^, = inf„g[o.i] WfWl = C^f). Let 

a; = \\f\\l>+s - ll/llad by substituting a' = into (9) we have that 



a: = ^(ll/illi + ll/ollo)^ 



ll/ollnll/ilb 

5 



o-ll/illi 



n -1 



(ll/ollo + ll/illi)^ 

As we are upper bounding x let us separately upper bound 






p(ll/o|lo,ll/illi,<5) = 



^^^ + ll/oll^-||/l||? 
(ll/ollo +II/1II1)" 



->5 



(19) 



(20) 



As a function of || /i || through routine calculations it can be shown that p obtains 
its maximum (for S G [0,1/4]) on either the boundary ||/i||;^ = 0 or ||/i|| = 

ll/ollo- Thus substituting, p(||/o||o, 0, 5) = and p(||/o|lo> ll/ollo. <^) = 
Therefore, for all S G [0,1/4], we have p(||/o|lo. ll/ollo. <^) < P(ll/ollo. 0. ^) < |- 
By combining the upper bound of | with (19) we have that for 6 G [0,1/4], 
|[/|la/+y < C^(/) -I- |(5C^(/); therefore with a" = a' -b <5 we are done. □ 
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Lemma 6. Let fca(vi,V 2 ) = exp(— soa||vi — V 2 ||^) denote a parameterized 
(a G [0,1]) Gaussian kernel with fixed scale constant sq > 1 over the domain 
[xi,a; 2 ]" X [xi,X 2 ]^ with associated prehilhert spaces Ha- Given a function ha’ G 
Ha’ such that jj^a'IL' ^ 1 'with representation ha’{-) = (''^G •) then set 

ha’+si) = ET=i /dika’+sfvi, ■)■ Then the square loss and squared norm ofha’+s 
may he hounded hy those of ha' plus any constant 0 < c < 1 for all sequences 
((xi, 2 /i), (x 2 ,?/ 2 ), ■ • ■ , G ([xi,X 2 ]”, [ri,r 2 ])^ w/iere max(|ri |, |r 2 |) > 1. 

Hence ~ V+i(xt))^ < ~ ^a'(xt))^ + c and \\ha’+s\\^a’+s ^ 

ll^a'L' + C for all S G [0, 5 sQ£max(|ri|.|r 2 |)n(a: 2 -a:i)=' 52 (h^,)]’ 

Proof. Omitted in this abstract; see full version. 

Proof (sketches of Theorems f and 5). The theorems follow directly from The- 
orem 1 with Lemmas 5 and 6 with A chosen so that in each fr{A) = S. 



The following inequality is needed for Theorem 6. 
Lemma 7. Suppose r > 1, then r + ^ > ln(l -|- y)“^. 
Proof (of Theorem 6). Let r = ^ and choose n s.t. 



{r+ -)lnz 



-I- 1 < n 



( 21 ) 



the samples are denoted 



where n -I- 1 is the total number of function samples. The function is sampled 
over n intervals with widths 

(i+r)(i-(^) ) 

/(a) = /o,/i,... ,/„ = f{h) where fi = f{a + Yl]=i^i)- We also need the 
following inequality relating r, n and z: 



r -I- 1 



1 

< -. 
z 



(22) 



From (21) and Lemma 7 it follows that, Inz < (n — 1) In 'GA which implies (22). 

Define M = fiAp, now define lower and upper bounds of y, L and 

U hy L = foAi + yjyiM < y < M + fnA„ = U. We proceed to show that 
y = ^{L + U) is a relative e-approximation of y, since U and L are upper and 
lower bounds if we can show [7(1 — e) < y < L(1 -|- e) which is equivalent to the 
conjunction of conditions 1 < e, and 1 < e. This will prove that y is 

an e-approximation of y. However, since L < U we need only show 1 < £• 

Thus, 

lU-L + 

2 L ~ 2 foA, + ^ ^ 



1 

< - 
- 2 



1 1 ( r \ f 

r+1 2^i—l yr+1 J 

/o+yTTi:”J'i'(FTT)V. 




(24) 



where (24) follows from (22). Hence y is an e-approximation of y with the req- 
uisite number of samples. □ 
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The following two lemmas give a method to obtain relative e-approximations 
of the denominator and the numerator of the predictions (5) k-lms-net. 

Lemma 8. Given a a -polynomial : [0, 1]— >-[0,oo) of degree s and a z € 

(l,oo), we may compute a relative e- approximation fg max(e~^^'^^“^\ z~^)da 

in time polynomial in s, ln( 0 ) and e~^ . 

Proof. Let I = By Corollary 1, we may puncture the interval [0,1] 

into no more than r < s regions with the (up to) s — 1 of the critical points I 
isolated into a total measure of no more than 2~p. By construction in each of the 
r regions min(Z, z~^) is monotonic; thus we may apply Theorem 6 to each of the 
r regions (with e = e'/2) since relative e-approximations add (cf Lemma 1). The 
estimator formed by adding the r estimators is a relative e'/2-approximation to 
max(l, z~^)da where E' is the interval [0,1] minus the isolates. However if 
we set p = 1 -|- log(z/e') by Lemma 4 we have a relative e'-approximation to 
fg max(l, z~^)da. □ 

Lemma 9. Given a a-polynomial f{u{-)) : [0,1]— >-[0, oo) of degree s and z € 
(l,oo), and a-polynomial g(a(-)) : [0, 1]— >■(— oo, oo) of degree t we may compute a 
relative e-approximation max(l, min(p(cr(a)), 1 -|- r)) max(e“-^*^'^(“^\ \)da 

in time polynomial in s, t, ln(z(l -|- r)), and e~^ . 

Proof. We sketch the proof for reasons of brevity and the fact that it closely 
follows the proof of Lemma 8. The key difference is the need to create additional 
isolates since it is subtle to clip g{a{a)) and independently. In fact we 

need to isolate the critical points of g{a{a))e~^^'^^°'>\ g{a{a)), and 
and the zeroes of g{a{a)) = 1, g{a{a)) = 1 -I- r, and = z~^. Now we 

can ensure that we can correctly clip and also for each region between isolates 
that the function max(l, min(g(CT(a)), 1 -I- r)) max(e“-^^‘^*-“^\ is monotonic. 
We observe that the number of isolates is polynomial in s and t. □ 

We compute a shifted version of the quotient (5) so that the predictions and 
outcomes may have both positive and negative values. 

Proof (of Theorem 7). Apply Lemmas 9 and 8 to give relative e'-^pproximations 
max(l, min(p(cr(a)) -|- 1 — ri, 1 -I- r 2 — ri)) max(e“-^*^'^^“^\ and 
fg max(e~-^^'^^°'^\ z~^)da with e' = Observe that 

, /p^ max(l,min(p(cr(a)) -I- 1 — ri, 1 -I- T 2 — ri)) max(e“^('^(“^\ 

y pi ^ f, , ss , , , (25) 

fg max(e“^0(“b, 2;-i)dQ; 

is an expectation of a quantity bounded by 1 and l+r 2 — ri, hence y' < l-|-r 2 — ri. 
Therefore by Lemmas 2 and 3 j^^y' ■ Since y' = y -\- 1 — ri, we conclude that 

f-(l-n)4p. 



□ 
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Abstract. The decomposition method is currently one of the major 
methods for solving the convex quadratic optimization problems being 
associated with Support Vector Machines (SVM-optimization). A key is- 
sue in this approach is the policy for working set selection. We would like 
to find policies that realize (as good as possible) three goals simnltane- 
ously: “(fast) convergence to an optimal solution”, “efficient procedures 
for working set selection”, and “high degree of generality” (including 
typical variants of SVM-optimization as special cases). In this paper, we 
study a general policy for working set selection that has been proposed 
quite recently. It is known that it leads to convergence for any convex 
quadratic optimization problem. Here, we investigate its computational 
complexity when it is used for SVM-optimization. We show that it poses 
an NP-complete working set selection problem, bnt a slight variation 
of it (sharing the convergence properties with the original policy) can 
be solved in polynomial time. We show furthermore that so-called “rate 
certifying pairs” (introdnced by Hnsh and Scovel) can be fonnd in lin- 
ear time, which leads to a quite efficient decomposition method with a 
polynomial convergence rate for SVM-optimization. 



1 Introduction 

Support vector machines (SVMs) introduced by Vapnik and co-workers [1,23] 
are a promising technique for classification, function approximation, and other 
key problems in statistical learning theory. In this paper, we consider the opti- 
mization problems that are induced by SVMs, which are special cases of convex 
quadratic optimization. 

The difficulty of solving problems of this kind is the density of the matrix 
that represents the “quadratic part” of the cost function. Thus, a prohibitive 
amount of memory is required to store the matrix and traditional optimization 
algorithms (such as Newton, for example) cannot be directly applied. Several 
authors have proposed (different variants of) a decomposition method to over- 
come this difficulty [20,6,21,22,17,18,2,9,13,19,10,8,14,15,11,3,12,4]. This method 
keeps track of a current feasible solution which is iteratively improved. In each 

* This work was supported in part by the 1ST Programme of the European Commu- 
nity, under the PASCAL Network of Excellence, IST-2002-506778. This publication 
only reflects the authors’ views. This work was furthermore supported by the Deut- 
sche Forschungsgemeinschaft Grant SI 498/7-1. 
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iteration the variable indices are split into a “working set” / C m} and its 

complement J = {1, . . . , m} \ I. Then, the subproblem with variables Xi, i G I, 
is solved, thereby leaving the values for the remaining variables Xj, j G J, un- 
changed. The success of the method depends in a quite sensitive manner on the 
policy for the selection of the working set I (whose size is typically bounded by 
a small constant). Ideally, the selection procedure should be computationally ef- 
ficient and, at the same time, effective in the sense that the resulting sequence of 
feasible solutions converges (with high speed) to an optimal limit point. Clearly, 
these goals are conflicting in general and trade-offs are to be expected. 

Our results and their relation to previous work. We study a general policy for 
working set selection that has been proposed quite recently [16]. It is known that 
it leads to convergence for any convex quadratic optimization problem. Here, we 
investigate its computational complexity when it is used for SVM-optimization. 
As a “test-case”, we discuss one of the most popular and well-studied SVM- 
optimization problems.^ We show that the application of the general policy poses 
an NP-hard working set selection problem (see Section 4). Although this looks 
frustrating at first glance, it turns out that a slight variation of the general policy 
(replacing in some sense “exact working set selection” by “approximate work- 
ing set selection”) allows still for a general convergence proof (valid for convex 
quadratic optimization in general; see Section 3) and poses an efficiently solv- 
able working set selection problem as far as SVM-optimization is concerned (see 
Section 5). 

It would be clearly desirable to achieve a high convergence speed. Hush and 
Scovel [4] proved recently that the decomposition method (when applied to the 
same SVM-optimization problem that we discuss in our paper) leads to a poly- 
nomial convergence rate if the working set always contains a so-called “rate 
certifying pair”. They also provided an 0(m log m) algorithm for finding such a 
pair. We design a new algorithm that finds a rate certifying pair in linear time. 
This leads (to the best of our knowledge) to the first working set selection pro- 
cedure that is (almost) as efficient as Joachim’s method [6] used in the software 
package SVM**®^* and has (on top of being efficient) a provable polynomial con- 
vergence rate. 

More relations to previous work will be discussed in the full paper. 

2 Preliminaries 

Throughout this paper, 

^ ^ m m m 

f{x) = -x^Qx - w^x = 2 X! X! - X! 

j — 1 i—1 

denotes a convex cost function, where Q € is a positive semi-definite 

matrix over the reals. We are interested in optimization problems of the following 

^ The dual problem to maximum margin classification with the 1-norm soft margin 
criterion 




326 



H.U. Simon 



general form V'. 

min f{x) s.t. Ax = b,l < x < r (2) 

X 

Here, A G b G l,r G K"*, and I < x < r is the short-notation for the 

“box constraints” 

\/i = 1, . . . ,m : li < Xi < Ti . 

In the sequel, 

R{V) = {x G K"*| Ax = b,l < X < r} 

denotes the compact set of feasible points for P. 

We briefly note that any bounded^ optimization problem with cost function 
f{x) and linear equality- and inequality-constraints can be brought into the 
form (2) because we may convert the linear inequalities into linear equations by 
introducing non-negative slack variables. By the compactness of the region of 
feasible points, we may also put a suitable upper bound on each slack variable 
such that finally all linear inequalities take the form of box constraints. 

Example 1. A popular (and actually one of the most well studied) variant of 
Support Vector Machines leads to an optimization problem of the following form 
Vo-. 

min/(a;) s.t. x = 0 , I < x < r (3) 

X 

Here, y G {—1, 1}™ is a vector whose components represent binary classification 
labels. Note that we may “normalize” Vo by substituting yiXi for x*. This leads to 
a problem of the same form as Vo but with y =1, where 1 denotes the “all-ones” 
vector. (An analogous notational convention is used for any other constant.) The 
main difference between Vo and the general problem V is that Vo has only the 
single equality constraint y^x = 0.^ 

The decomposition method with working sets of size at most q proceeds iter- 
atively as follows: given a feasible solution x G R{V) (chosen arbitrarily in the 
beginning), a so-called working set / C {!,..., m} of size at most q is selected. 
Then x is updated by the optimal solution for the subproblem with variables Xi, 
i G I (leaving the values Xj with j ^ I unchanged). The policy for working set 
selection is a critical issue that we briefly discuss in the next section. 

3 Working Set Selection and Convergence 

Consider the situation where we attack a problem V of the general form (2) 
by means of the decomposition method. As explained below, the selection of a 

^ Here, “bounded” means that the region of feasible points is compact (or can be made 
compact without changing the smallest possible cost). 

® As far as the SVM-application is concerned, we could also set w = 1, I = 0, and 
r = C for some constant C > 0. Since these settings do not substantially simplify 
the problem, we prefer the more general notation. 
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working set / C m} of size at most q can be guided by a function family 

Ci{x). For instance, the following family was considered in [16]: 

Ci{x) := inf '^{xi-k) ma.x{0,V f{x)i~Aj h}+{r,-x,) ma.x{0, AJ h-V f{x)i} 

( 4 ) 

Here, Ai denotes the z’th column of A and Aj its transpose. 

The following results, being valid for any optimization problem V of the 
form (2), were shown in [16]: 

1. The function family Ci{x) defined in (4) satisfies the following conditions: 
(Cl) For each I C {l,...,m} such that [/[ < q, Ci{x) is continuous on 

R{V). 

(C2) If [/[ < q and x' is an optimal solution for the subproblem induced by 
the current feasible solution x and working set I, then Ci{x') = 0. 

(C3) If X is not an optimal solution for V, then there exists an / C {1, . . . , m} 
such that [/[ < q and Ci{x) > 0 provided that q> k+ 1. 

2. Let q> k + 1 (where k is the number of equality constraints in V). Assume 

that Q G (the matrix from the definition of the cost function in (1)) 

has positive definite submatrices Qjj for all subsets / C {!,... ,m} of size 
at most q.'^ Let Ci{x) be a family of functions^ that satisfies conditions 
(C1),(C2),(C3). Let a decomposition method induced by Ci{x) be an algo- 
rithm that, given the current feasible solution x, always chooses a working 
set / C {1, . . . , m} of size at most q that maximizes Ci{x). Then this method 
converges to an optimal solution. 

We will show in Section 4 that the working set selection induced by the 
family Ci(x) from (4) is (unfortunately) an NP-complete problem already for the 
(comparably simple) SVM-optimization problem Vq from (3). We will, however, 
design a polynomially time-bounded algorithm that, given the current feasible 
solution X, always finds a working set / C {!,... ,m} of size at most q that 
maximizes Ci{x) up to factor 4. This raises the question whether convergence 
can still be granted when the selection of a working set that maximizes Ci{x) 
(referred to as “Exact Working Set Selection”) is replaced by the selection of 
a working set that maximizes Cj(x) only up to a constant factor (referred to 
as “Approximate Working Set Selection”). It turns out, fortunately, that this is 
true: 

Theorem 1. Let V he the optimization problem given by (2) and (1), and let 
q>k+\- Assume that Q G is a matrix with positive definite submatrices 

Qi,i for all subsets I C {!,..., m} of size at most q. Let Cj(x) be a family 

^ This assumption does not follow automatically from the positive semi-definiteness 
of Q (though it would follow from positive definiteness of Q). It is however (at least 
for SVM-applications) often satisfied. For some kernels like, for example, the RBF- 
kernel, it is certainly true; for other kernels it typically satisfied provided that q is 
sufficiently small. See also the discussion of this point in [13]. 

® Not necessarily identical to the family defined in (4) 
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of functions that satisfies conditions (C1),(C2),(C3). Let an approximate de- 
composition method induced by Ci{x) he an algorithm that, given the current 
feasible solution x, always chooses a working set I C {1, . . . , m} of size at most 
q that maximizes Ci{x) up to a constant factor. Then this method converges to 
an optimal solution. 

Theorem 1 (whose proof is similar to the original proof of convergence in [16] and 
therefore omitted in this abstract) provides a justification to pass from “Exact 
Working Set Selection” to “Approximate Working Set Selection” . 

4 Complexity of Exact Working Set Selection 

The function family from (4) can be expressed in a simpler fashion when we 
consider the special optimization problem Vo from (3) in its “normalized version” 
(with y = 1): 

Ci{x) = inf ^(xi - k) max{0, Vf{x), - t} + (r* - Xi) max{0, t - Vf{x)i} 

This leads us to the following (purely combinatorial) problem:® 

EXACT WORKING SET SELECTION (EWSS). Given numbers oi < 
• • • < Om, non-negative weight parameters wi,w[, . . . , Wm, w'm a non- 
negative number q, find a “working set” I C {1, . . . ,m} of size at most q 
that maximizes the “gain” 

w-{t-a^)+ w,{a,-t) . 

When we consider EWSS as a decision problem, we assume that the input con- 
tains an additional bound B. The question is whether there exists a working set 
that achieves a gain of at least B. 

We may think of a i, as points on the real line that are ordered from 
left to right. EWSS corresponds to a game between two players: 

1. Player 1 selects q' < q points. Intuitively, this means that she commits herself 
to a working set / C {1, . . . , m} of size q' . 

2. Player 2 selects a threshold t and pays an amount of X)iG/-a <t ~ + 

Mai - t) to player 1. 

Given I (the move of player 1), the optimal move of player 2 is easy to 
determine. We say that threshold t is in a balanced position if the following 
holds: 

If ^ ^ : am}, then 

® The correspondence is as follows. Assume that \7 f{x)i < • • • < V f{x)m (after rein- 
dexing if necessary). Now, at represents V f{x)i, Wi represents Xi — h, w[ represents 

Vi - Xi. 
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2 . 



lft = Qj for some j G {1, . . . , m}, then > J2iLj+i Wt and ^ 

XO'- 



Lemma 1. Given I, the balanced positions of threshold t form either an interval 
[oj, Oj+i] or a singleton set {aj}. Furthermore, threshold t represents an optimal 
move of player 2 iff it is in a balanced position. 



Proof. The first statement ist easy to see. We omit the straightforward proof in 
this abstract. The second statement is based on the following observations (valid 
for some sufficiently small (5 > 0): 



1 . If we replace t hy t — S, then the gain of player 1 changes by an amount of 



Wi- w'i S . 



2. If we replace t hy t + S, then the gain of player 1 changes by an amount of 



^ w'- ^ Wi (5 






It follows that a theshold t in an unbalanced position cannot be an optimal move 
of player 2 because the gain of player 1 could be reduced by moving t in direction 
to the side with the larger total weight. Thus, a unique balanced position of the 
form t = aj is an optimal move. If all positions in an interval are 

balanced, then any move t G [aj,aj_|_i] leads to the same gain of player 1. Thus, 
any such threshold represents an optimal move of player 2. □ 



Corollary 1. Given I, a threshold t G {oi, . . . , am} in balanced position is found 
in linear time. 

Proof. Let t run through the sorted list oi < • • • < am and keep track of the 
total ic'-weight of {ai\ai < f\ and the total w-weight of {afai > t}. Stop when 
the balance condition is satisfied. □ 



We say that t is an S-unificator for some S' > 0 if 
S = w[,{t - a*/) = Wi{a^ - t) 

holds for alH, G {1, . . . , m} such that Ui' < t and Ui > t. 

Lemma 2. If t is an S-unificator, then 

Wi{ai -t) < qS 



max inf > wUt — Oi ) > 









with equality iff there exists a working set I of size q that puts t in a balanced 
position. 
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Proof. If player 2 chooses threshold t, then the gain for player 1 is \I\S < qS (no 
matter which working set I was chosen). Thus, qS is the largest gain player 1 
can hope for. If she finds a working set / of size q that puts t in a balanced 
position, the gain qS is achieved (since player 2 will find no threshold superior 
to t). If not, player 2 can reduce the gain from qS to a strictly smaller value by 
shifting t into the direction of the “heavier” side. □ 

Recall that SUBSET SUM is the following NP-complete problem: 

Given non-negative integers Si < ■ ■ ■ < Sm ^ S , decide whether there exist an 
I C {!,..., m} such that 

We consider here a variant of SUBSET SUM where the input contains an 
additional parameter q < m and the question is whether there exists an / C 
{!,..., m} such that |/| = q and refer to this problem as 

“SUBSET SUM with bounded subsets”. 

Lemma 3. The problem “SUBSET SUM with bounded subsets” is NP-complete. 

Proof. The problem clearly belongs to NP. NP-hardness basically follows from 
a close inspection of Karp’s [7] polynomial reduction from 3-DIMENSIONAL 
MATCHING to PARTITION. We omit the details in this abstract. □ 

Theorem 2. “Exact Working Set Selection” (viewed as a decision problem) is 
NP-complete. 

Proof. Since the optimal threshold t for a given working set / can be found 
efficiently (and the appropriate working set can be guessed), the problem clearly 
belongs to NP. We complete the proof by presenting a polynomial reduction from 
“SUBSET SUM with bounded subsets” to EWSS. From the input parameters 
Si < • • • < Sm < <5' and q of “SUBSET SUM with bounded subsets”, we derive 
the following input parameters of EWSS: 

— ao = — 1 and Oj = S'/si for z = 1, . . . , m. 

— {wo, w'q) = (0, S) and (wi, w'f) = (si, S - Si) for i = 1, ... ,m. 

— Parameter q' = q -t- 1 is considered as the bound on the size of the working 
set and B = q'S. 

Note that threshold t = 0 is an S'-unificator. Thus, the largest possible gain for 
player 1 is R = q'S. This gain can be achieved iff there exists a working set S of 
size q' that puts threshold 0 in a balanced position. Since 0 ^ {ao,oi, . . . ,Om}, 
0 is in a balanced position iff 

According to the choice of the input parameters for EWSS, condition (5) implies 
that 'Ezei:ai<ow', = w'o = S and = EiG/:a,>oS*. ^here the latter 

sum contains q terms. 

This discussion can be summarized as follows: player 1 can achieve gain at least 
B = q'S iff there exist q terms from si, . . . , s„ that sum-up to S. This shows 
that “SUBSET SUM with bounded subsets” can be polynomially reduced to 
EWSS. □ 
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5 Complexity of Approximate Working Set Selection 

Recall that KNAPSACK is the following NP-complete problem: 

Given a finite set M of items, a “weight” w{i) > 0 and a “value” v(i) > 0 for 
each i € M , a “weight bound” W > Q and a “value goal” V > 0, decide whether 
there exists a subset I C M such that 

Wx := w{i) < W and Vx := v{i) > V . 

iex iex 

In order to view KNAPSACK as optimization problem, we may drop the value 
constraint Vx >V and ask for a subset I that leads to the largest possible total 
value Vx subject to the weight constraint Wx < W. 

We consider here a variant of KNAPSACK where the input contains an 
additional parameter q < n and the constraint \I\ < q is put on top of the 
weight constraint. We refer to this problem as “KNAPSACK with a bounded 
number of items”. 

Lemma 4. “KNAPSACK with a bounded number of items” can be solved by a 
full approximation scheme within 0{kq‘^\M\^) steps, i.e., there exists an algo- 
rithm that, on input (M,w,v,W,q,k), runs for 0{kq‘^\M\'^) steps and outputs a 
(legal) solution I whose total value Vx is within factor 1 + 1/A: of the optimum. 

There exists a full approximation scheme for KNAPSACK due to Ibarra and 
Kim [5] that is based on “Rounding and Scaling” . The full approximation scheme 
for “KNAPSACK with a bounded number of items” can be designed and anal- 
ysed in a similar fashion. We omit the proof of Lemma 4 in this abstract. 

Theorem 3. Any algorithm that solves “KNAPSACK with a bounded number 
of items” within T{\M\,q) steps and outputs a solution whose total value is 
within factor c of the optimum can be converted into an algorithm that solves 
“Approximate Working Set Selection” within 0{qmT{m,q)) steps and outputs a 
solution that is within factor 2c of the optimum. 

Proof. Let I* denote a working set that leads to the largest possible gain, say 
g*, for player 1 and A* the corresponding threshold (being in a balanced posi- 
tion w.r.t. /»). According to Lemma 1, we may assume without loss of generality 
that A* G {ai, . . . ,am}. Define 

M = {1, . . . ,to}, M' = {i G M\ Oi < A*} and M" = {A G M| a* > A*} 

and 

I( = M(n h and I” = M'f n h ■ 

Note that g* = g(-\- g'f where 

9* = '^ w'(A - a^) and g'f = ^ Wi{ai - A) . 
ie+ iG/y 

Consider the following non-deterministic^ algorithm: 

^ We will later remove non-determinism by means of exhaustive search. 
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1. Guess j £ M such that t* = aj, guess q' = |/'|, and guess whether 5' > g". 

2- If g't > g't, then proceed as follows: 

a) Pick a set /' C M' of size q' such that the parameters w[ with i £ I' are 

the q' largest weight parameters in - Compute the total weight 

b) For every i £ M” , set Vi = Wi{ui — t»). Intuitively, Vi represents the 
gain that results from putting i into the working set. Apply the given 
algorithm, say A, for “KNAPSACK with a bounded number of items” 
to the following problem instance: 

- M" is the set of items. 

- Wi is the weight and Vi the value of item i £ M". 

- W + Wj is the weight bound and q" = q — q' is the bound on the 
number of items. 

Let I” be the set of indices that is returned by A. 

3. If g'l < g[, then apply the analogous procedure. 

4. Output working set 7 = /' U 

Let’s see how to remove the initial non-deterministic guesses. In a deterministic 
implementation, the algorithm performs exhaustive search instead of guessing: 
it loops through all j £ M and q' £ {0, . . . ,g}, and it explores both possible 
values (TRUE and FALSE) of the Boolean expression g” > (/' . Clearly, all but 
one runs through these loops are based on “wrong guesses” . A wrong guess may 
even lead to an inconsistency. (For example, there might be less than q' points 
left of Oj.) Whenever such an inconsistency is detected, the algorithm aborts 
the current iteration and proceeds with the next. Each non-aborted iteration 
(including the iteration with the correct guesses) leads to a candidate working 
set. A candidate set leading to the highest gain for player 1 is finally chosen. 
Observe now that 7" is a legal solution for the knapsack problem since it is a 
subset of M” of size at most q" = q — q' and 

Wi < Wj + w'i < Wj + W' . 

i&I'J 



The first inequality holds because is in a balanced position; the second-one 
holds because W is the total weight of the q' “heaviest” points left of t*. 

Let’s now argue that the solution 7 = 7' U 7" obtained in the iteration with 
the correct guesses leads to a gain g > /(2c). Let g' and g" denote the gains 

represented by 7' and 7", respectively (such that g = g' + g”)- For reasons of 
symmetry, we may assume that g'^ > (/'. Then, the following holds: 



g > g > - max 
c 



IQM'',\I\<q!',Y,w,<Wj + W'\ 
I i^X ) 



i^I'J 



^ 1 \ " g” ^ g 

^ c 2c 



Note that I is the variable running through all legal solutions for the knapsack 
problem instance. The second inequality holds because, by assumption, A has 
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performance ratio c. The third inequality holds because I” is a legal solution for 
the knapsack problem instance. The final inequality is valid since g" > g'^. 

The time bound 0{qmT(m, q)) is obvious since there are basically 0{qm) calls 
of the given algorithm A with time bound T . □ 

Combining Lemma 4 and Theorem 3 (thereby setting k = 1 and c=l + l/fc = 2 
for sake of simplicity) , we get 

Corollary 2. There exists an algorithm for “Approximate Working Set Selec- 
tion” that runs in time 0{q^m^) and finds a working set whose gain is within 
factor 4 of the optimum. 

We briefly note the following 

Theorem 4. There exists a full approximation scheme for “Approximate Work- 
ing Set Selection”. 

The proof will be found in the full paper. 

6 Linear Time Construction of a Rate Certifying Pair 

We consider again the optimization problem Vq from (3). Let x G R(fPo) be a 
feasible solution and x* G R{Po) an optimal feasible solution. Define 



a(x) := sup Vf(x)^(x — x') (6) 

x'eR{Vo) 

a{x\ii,i 2 ) ■= sup yf{x)^{x — x') . (7) 

x’ ^R{Vo):x'^—Xi for 

As already noted by Hush and Scovel [4], the following holds:® 

f{x) - f{xt.) <V f{x)^{x - Xt) < a{x) . (8) 

(zi,i 2 ) is called an a-rate certifying pair for x if 

cr{x\ii,i 2 ) > a{f{x) - f{x^.)) . (9) 



An a-rate certifying algorithm is a decomposition algorithm for Vq that always 
includes an a-rate certifying pair for the current feasible solution in the working 
set. Hush and Scovel [4] have shown that any a-rate certifying algorithm has a 
polynomial rate of convergence. Furthermore, they provided an algorithm that, 
given x and V/(a;), finds an (l/m^)-rate certifying pair for x in 0{mlogm) 
steps. We can improve on the latter result as follows: 

Theorem 5. There exists a decomposition algorithm for Vq that, given the cur- 
rent feasible solution X and V/(a;) in the beginning of the next iteration, computes 
an {\/mf)-rate certifying pair for x in 0{m) steps. 



The first inequality follows from a convexity argument; the second-one is trivial. 
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Proof. Without loss of generality, we may assume that Vo is normalized, i.e., y 
in (3) is the all-ones vector. According to (8), 

a{x\ii,i2) > Q.a{x) ( 10 ) 

is a sufficient condition for (9). Chang, Hsu, and Lin [2] have shown that, for every 
X G R{Vo), there exists a pair (A, 12 ) that satisfies the stronger condition (10) 
for a = \jrnf . In order to find an (l/m^)-rate certifying pair for x, it is therefore 
sufficient to construct a pair ( 11 , 12 ) that maximizes a{x\ii,i 2 ). To this end, we 
define the quantities 



:= Xi - li and y,~ := ri-Xi . 

It is not hard to see that the following holds: 

Claim 1. If V/(x)i^ > Vf{x)i.^ then 

a{x\ii,i 2 ) = iyf{x),^ - V/(a;)*Jmin{/r+,^);} . 

Proof of Claim 1. According to (7), we may choose x' such that x' = 0, 
I < x' < r, x'^ = Xi for every i ^ and 

a{x\ii,i2) = Vf{x)i^{xi^ -x'J + V/(x)*2(x*2 - cc' J . 

Recall that a; (as a member of R{Vo)) also satisfies x = 0 and I < x < r. 
Thus, 

0 = I^(a; - x') = {x^^ - x'ij + (xi^ - x'J , 
which implies that 

d := Xi^ - x'i^ = x'i^ - Xi, (11) 

and 

a(xlii,i 2 ) = d- (Vf(x)^, - Vf(x)^,) . 

Since, by assumption, Vf{x)i^ — Vf{x)i, > 0, it follows that x' was chosen 
such as to maximize d from (11) without violating the box-constraints. The 
largest possible value for d is obviously 

d = mm{x^^ - x^,} = , 

which concludes the proof of Claim 1. 

Claim 2. For y G M := {xi — k,ri — Xi\ i = 1, . . . , m}, define 

CTp(a;) = max \7f{x)i^ - min \7f{x)i, y . 

Then, the following holds: 

maxcr(x|zi, 12 ) = max(T^(a;) . 

il,*2 p- 
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Proof of Claim 2. We first show that 

maxcr(x|zi, 12) < max(T^(a;) . 

il ,*2 M 

According to Claim 1, we may choose il,i2 such that 

0 < maxCT(x|zi,i 2 ) = {Vf{x)q - Vf{x)q) . 

Il,l 2 1 2 

Setting 

fi* := min{fj,+ , n~.} , 

we may proceed as follows: 

maxcr(a;|ti,i2) = (V/(a;)i. - V/(x)i.) 

< max max V/(a:)i^ — min V f{x)i^ 

^ *2:^1“ >Ai y 

The last inequality is valid because /i := /i* and ii := Zi,Z2 := *2 i® among 
the possible choices for /x, Zi,*2- 
The reverse inequality 



maxcr„(x) < maxcr(a;|zi, 12) 

A* *1,*2 

can be shown is a similar fashion. Choose fx* and i*,Z2 such that 

fJ'ti ^ and > fi* 



and 



maxcr^(a;) = {Vf{x)i> - V/(cc)i*) yx* 

< (V/(x)i. - V/(x)i.) min{^+,/x,".} 

< maxcr(a;|zi, 12) . 

*1,12 

This concludes the proof of Claim 2. 

In order to compute the pair (zi,Z2) that maximizes a{x\ii,i2), we may apply 
Claim 2 and proceed as follows: 

Consider the values /x G M in decreasing order, thereby keeping track of the 
index xi(/x) that maximizes Vf{x)i^ subject to > /x and of the index X2(/x) 
that minimizes V f{x)i^ subject to /x“ > /x. Finally, pick the value /x* that 
maximizes (V/(x)jj(^) - V/(x)i2(^)) and output (xi(/x*), Z2(/x*))- 
It is not hard to see that this procedure can be implemented within 0 (m) steps 
if we maintain two lists. The first-one contains the items {i,Xi — U) and the 
second-one the items (z, Xi — Xi). Both lists are sorted non-increasingly according 
to their second component with i ranging from 1 to m.® □ 

® Note that this data structure can be efficiently updated because two subsequent 
feasible solutions differ in only two components. 
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Final Remarks. The time-bound from Lemma 4 for the full approximation 
scheme for “KNAPSACK with a bounded number of items” is polynomial, but 
probably leaves space for improvement. According to Theorem 3, any improve- 
ment on this time bound immediately leads to more efficient algorithms for 
“Approximate Working Set Selection”. 

The central object of our future research is the design of policies that lead 
to fast convergence to the optimum and, at the same time, can be efficiently 
implemented for a wide range of optimization problems. 
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Abstract. The decomposition method is currently one of the major 
methods for solving the convex quadratic optimization problems being 
associated with support vector machines. For a special case of such prob- 
lems the convergence of the decomposition method to an optimal solution 
has been proven based on a working set selection via the gradient of the 
objective function. In this paper we will show that a generalized version 
of the gradient selection approach and its associated decomposition al- 
gorithm can be used to solve a much broader class of convex quadratic 
optimization problems. 



1 Introduction 

In the framework of Support- Vector-Machines (SVM) introduced by Vapnik 
et al. [1] special cases of convex quadratic optimization problems have to be 
solved. A popular variant of SVM is the C-Support- Vector-Classification (C- 
SVC) where we try to classify m given data points with binary labels y G {±1}™. 
This setting induces the following convex optimization problem: 

min fc{x) ■= -x^Qx — e^x s.t. 0 < Xi < C, \/i = 1, . . . ,m, x = 0 , (1) 

X 2 

where x G K™, C is a real constant and e is the m-dimensional vector of ones. 
Q G is a positive semi-definite matrix whose entries depend on the data 

points and the used kernel^. 

In general, the matrix Q is dense and therefore a huge amount of memory is 
necessary to store it if traditional optimization algorithms are directly applied to 
it. A solution to this problem, as proposed by Osuna et al. [4], is to decompose 
the large problem in smaller ones which are solved iteratively. The key idea is to 
select a working set in each iteration based on the currently ,,best” feasible 
x^ . Then the subproblem based on the variables Xi,i G is solved and the new 
solution x^~^^ is updated on the selected indices while it remains unchanged on 

* This work was supported in part by the 1ST Programm of the European Community, 
under the PASCAL Network of Excellence, IST-2002-506778. This publication only 
reflects the authors views. 

^ The reader interested in more background information concerning SVM is referred 
to [2,3]. 
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the complement of B^. This strategy has been used and refined by many other 
authors [5,6,7]. 

A key problem in this approach is the selection of the working set B^. One 
widely used technique is based on the gradient of the objective function at 
[5] ^ . The convergence of such an approach to an optimal solution has been proven 
by Lin [8]^. A shortcoming of the proposed selection and the convergence proof 
of the associated decomposition method in [8] is that they have only been for- 
mulated for the special case of (1) or related problems with only one equality 
constraint. Unfortunately rz-SVM introduced by Schdlkopf et al. [10] leads to op- 
timization problems with two such equality constraints (see e. g. [11]). Although 
there is an extension of the gradient algorithm to the case of :z-SVM [12] no 
convergence proof for this case has been published. 



1.1 Aim of This Paper 

There exist multiple formulations of SVM which are used in different classifi- 
cation and regression problems as well as quantile estimation and novelty de- 
tection"*^ which all lead to slightly different optimization problems. Nonetheless 
all of them can be viewed as special cases of a general convex quadratic opti- 
mization problem (2, see below). As the discussion concerning the decomposition 
method has mainly focused on the single case of C-SVC, it may be worth study- 
ing under which circumstances the decomposition method is applicable to solve 
such general problems and when such a strategy converges to an optimal solu- 
tion. Recently Simon and List have investigated this topic. They prove a very 
general convergence theorem for the decomposition method, but for the sake of 
generality no practical selection algorithm is given [13]. 

The aim of this paper is to show that the well known method of gradient 
selection as implemented e. g. in [5] can be extended to solve a far more 

general class of quadratic optimization problems. In addition, the achieved the- 
oretical foundation, concerning the convergence and efficiency of this selection 
method, is preserved by adapting Lin’s convergence theorem [8] to the decom- 
position algorithm associated with the proposed general gradient selection. 

We want to point out that not necessarily all cases covered by the given 
generalization arise in SVM but the class of discussed optimization problems 
subsumes all the above mentioned versions of SVM. The proposed selection 
algorithm is therefore useful for the decomposition of all such problems and 
gives a unified approach to the convergence proof of the decomposition method 
associated with this selection. 



^ Platts’ SMO [6] with the extension of Keerthi et al. [7] can be viewed as a special 
case of that selection. 

® For the special case of SMO Keerthi and Gilbert have proven the convergence [9] . 

^ [2,3] both give an overview. 
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2 Definitions and Notations 



2.1 Notations 



The following naming conventions will be helpful: If A G is a matrix, 

Ai,i G {1, • ■ • , m} will denote the column. Vectors x G K™ will be considered 
column vectors so that their transpose x^ is a row vector. We will often deal 
with a partitioning of {1, . . . , m} in two disjunct sets B and N and in this case 
the notion Ab will mean the matrix consisting only of columns Ai with i G B. 
Ignoring permutation of columns we therefore write A = [Ab a«]. The same 
shall hold for vectors x G R™ where xb denotes the vector consisting only of 
entries Xi with i G B. We can therefore expand Ax = d to [.4 b Ajv ] (^^ ) = d. A 
matrix Q G then be decomposed to four block matrices Qbb, Qbn, 

Qnb and Qnn accordingly. 

Inequalities I < r of two vectors l,r G will be short for li < ri,\/i G 
{!,... , to}. In addition we will adopt the convention that the maximum over an 
empty set will be — oo and the minimum oo accordingly. 

Throughout the paper, we will be concerned with the following convex quad- 
ratic optimization problem V: 



min f{x) 

X 



-x^Qx + c^x s.t. I < X < u, Ax = d 



(2) 



where Q is a positive semi-definite matrix and A G l,u,x,c G R"^ and 

d G R". The feasibility region of V will be denoted by TZ{V). 



2.2 Selections, Subproblems, and Decomposition 

Let us now define the notions used throughout this paper. 

Definition 1 (Selection). Let q < m. A map B : R™ — > fp({l,... ,to|) 
such that |S(a:)| < q for any x and x is optimal wrt. V ijf B(x) = 0 is called 
(q-) significant selection wrt. V 

Definition 2 (Subproblem). For a given set B C {I,--. ,to| with \B\ < q, 
N := {1, . . . , m}\B and a given x G 7Z(P) we define fB,XN •= Qbbx' + 
{cb + QbnXn)^ x' for every a;' G R"^. The following optimization problem Vb.xn 

xafnfB,XNix') s.t. Ib < x' < ub, Abx' = d - Anxn (3) 

x' 

will be called the subproblem induced by B.® 

We are now in the position to define the decomposition method formally: 

Algorithm 1 (Decomposition Method). The following algorithm can be as- 
sociated with every significant selection wrt. V 

® If a: is evident from the context we often write B instead of B{x). 

® Note, that is a special case of P. 
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1: Initialize: A: t— 0 and a;° G TZ{V) 

2: B t- 

3: while B do 

4: N ^ ,m}\B 

5: Find x' as an optimal solution of T’b.x'^ 

6: Set x^+^ t- f ) 

V J 

7; k^k+l, B ^ 

8: end while 

This algorithm shall he called the decomposition method ofV induced hy B. 



3 Gradient Selection 

The idea of selecting the indices via an ordering according to the gradients of 
the objective function has been motivated by the idea to select indices which 
contribute most to the steepest descent in the gradient field (see [5]). We would 
like to adopt another point of view in which indices violating the KKT-conditions 
of the problem V are selected. This idea has been used e. g. in [7] and [11]. 

To motivate this we will first focus on the well-known problem of C-SVC 
(Sec. 3.1) and later enhance this strategy to a more general setting (Sec. 3.2). 



3.1 C-SVC and KKT-Violating Pairs 

In the case of C-SVC a simple reformulation of the KKT-conditions leads to 
the following optimality criterion: x is optimal wrt. (1) iff there exists a 6 G K 
such that for any i G {1, . . . , m} 

(xj > 0 V/(x)j - hyi < O) and {xi < C ^ - byi > O) . (4) 

Following [8] (4) can be rewritten as 

{i G Ibot(,x^ |/iVy(x)j ^ 6) and (f G Itopix^ yiVy*(x)^ ^ 6) , 
where 



ktopix^ • I {Xi ^ ^) '^ i,Xi ^ 0, 1)} 

hot{x) := {i I {Xi > 0,yi = -1) A {x^ <C,yi = 1)} . 

The KKT-conditions can therefore be collapsed to a simple inequality^: x is 
optimal iff 



max yi\7f{x)i < ^ 2/*V/(x)i (5) 



^ Note, that if one of the sets is empty, it’s easy to fulfill the KKT-conditions by 
choosing an arbitrarily large or small b. 
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Given a non-optimal feasible x, we can now identify indices {i,j) G Itop{x) x 
Ibot{x) that satisfy the following inequality: 

yNf(.x)i > yj\7f{x)j 

Such pairs do not admit the selection of a 6 G K according to (4). Following [9], 
such indices are therefore called KKT-violating pairs . 

From this point of view the selection algorithm proposed by Joachims chooses 
pairs of indices which violate this inequality the most. As this strategy, imple- 
mented for example in SVM**®^* [5], selects the candidates in Itop{x) from the 
top of the list sorted according to the value of yiVf{x)i, Lin calls indices in 
Itop{x) „top-candidates” (see [8]). Elements from Ibot{x) are selected from the 
bottom of this list and are called „bottom~candidates” . We will adopt this nam- 
ing convention here. 



3.2 General KKT Pairing 

We will now show that a pairing strategy, based on a generalized version of 
the KKT-conditions in (5), can be extended to a more general class of convex 
quadratic optimization problems. The exact condition is given in the following 
definition: 

Definition 3. Let V he a general convex quadratic optimization problem as 
given in (2). If any selection of pairwise linearly independent columns Ai G K" 
of the equality constraint matrix A is linearly independent, we call the problem 

V decomposable by pairing. 

Note, that most variants of SVM, including zz-SVM, fulfill this restriction. 

We may then define an equivalence relation on the set of indices z G {1, . . . , m} 
as follows: 



i ^ j 4=r' G M \ {0} : ^i,jAi — Aj (6) 

Let {v I r = 1, . . . , s} be a set of representatives of this relation. The subset of 
corresponding columns 

{ar ■■= Ai^ I r = 1, . . . , s} C {Ai | z = 1, . . . , m} 

therefore represents the columns of A up to scalar multiplication. Additionally, 
we define Aj := for any z G [A]. Thus XiAi = Or if z G [ir]- 

From Definition 3 we draw two simple conclusions for such a set of rep- 
resentatives: As all {or I r = 1 ,... ,s} are pairwise linearly independent by 
construction, they are linearly independent and it follows that s = dim(ar | r = 
1 , . . . , s) < rank A < n. 

To formulate the central theorem of this section we define our generalized 
notion of ,,top” and ,, bottom” candidates first: 
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Definition 4. Let {ii,... ,is} be a set of representatives for the equivalence 
relation (6). For any r G {1, . . . , s} we define: 

Itop,r{x) := [ir] n {j I {xj > Ij A Xj > 0) V (xj < Uj A Xj < 0)} 
hot,r{x) := [v] n {j I {xj > Ij A Xj < 0) V {xj < Uj A Xj > 0)} 

Indices in Itop,r{x) (Ibot,r(x) ) are called top candidates (bottom candidates) . 
Top candidates for which Xi = U or Xi = Ui are called top only candidates. 
The notion bottom only candidate is defined accordingly. We use the following 
notation: 



Itop.rix) : — {i G Itop.rix) \ Xi G , 

Ibot.rix') • — \i ^ hbot^rix^ \ Xi G 

The following naming convention will be helpful: 



Itopi^x) : — Itop,r{_x) . 

hot{x), hop{x) and Ibot{x) are defined accordingly. 

We will now generalize (5) to problems V decomposable by pairing. The KKT- 
conditions of such a problem V say that x is optimal wrt. V iff there exists an 
h G K” such that for any i G , m} 

{^Xi > k Vf{x)i — AJ h <0) and (xi < Ui ^ Vf{x)i — Aj h>0). 

Given a set of representatives {v | r = 1, . . . , s} and Ai = for i G [ir] this 
condition can be written as follows: x is optimal wrt. V iff there exists an ft, G M" 
such that for any f G {1, . . . , m} 

{i G Itop.rir] Xi^f(^x]i ^ a^ h) and {i G Ibot^rir] XfS/ f{x)i ^ a^ h') (7) 

The following theorem will show that the KKT-conditions of such problems can 
be collapsed to a simple inequality analogous to (5): 

Theorem 1. If problem V is decomposable by pairing and a set of representa- 
tives {ii,... ,ir} for the equivalence relation (6) is given, the KKT-conditions 
can be stated as follows: 

X is optimal wrt. V iff for all r G {1, . . . , s} 

max XiX/ f{x)i < min XiS/ f{x)i (8) 

i&Itop,r(x) i&Ibot,r{x) 

Proof. For an optimal x equation (8) follows immediately from (7). 

To prove the opposite direction we define h(gp '■= fi^)i ond 

Kot ■= for any r G {1, . . . , s}. By assumption h(^p < 
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and we can choose G ^ decomposable by pairing, it follows 

that (oi,... ,Os)^ G represents a surjective linear mapping from K" to 

Thus there exists an h G K" such that for all r G {1, . . . , s} \ of h = hJ’ . We 
conclude that 

f{x^i "G h — h if i G Itop,r^X^ 

Xf^ f(,X^i ^ hfyQj, h — Gy, h if 1 G dhot,r{x^ 

holds for all i G {I, . . . , m}. Therefore the KKT-conditions (7) are satisfied and 
X is an optimal solution. □ 

3.3 Selection Algorithm 

Given a set of representatives {ii, . . . ,ig} and the corresponding Xi for all 
i G {I,-- - ,m} we are now able to formalize a generalized gradient selection 
algorithm. For any x G TZ{V) we call the set 

C(x) . — {(bj) ^ htop,r{xX) X Ibot,r (^) | Xf^ f (x) j Xj ^ f {x^ j >0,r — 1,... 



selection candidates wrt. x. 

Algorithm 2. Given a feasible x, we can calculate a B{x) C {1, . . . , m} for an 
even q as follows: 

1: Initialize: C G- C{x), B <— il), I ^ q. 

2: while {I > 0) and (C yf 0) do 

3: Choose (i,j) = argmax(y_^)gC X^Vf{x)i - XjVf{x)j 

4: B B G {i,j}. 

5: C^C\{i,j}^, 1^1-2. 

6: end while 
7 ; return B 

We note that the selection algorithms given in [5,8,11,12,7] can be viewed as 
special cases of this algorithm. This holds as well for the extensions to jz-SVM. 

4 Convergence of the Decomposition Method 

Theorem 1 implies that the mapping B returned by Algorithm 2 is a significant 
selection in the sense of Definition 1 and therefore induces a decomposition 
method. Such a decomposition method converges to an optimal solution of V as 
stated in the following theorem: 

Theorem 2. Let V be a convex quadratic optimization problem decomposable 
by pairing. Then any limit point x of a sequence of iterative solutions 

of a decomposition method induced by a general gradient selection according to 
Algorithm 2 is an optimal solution ofV. 

The proof is given in the following sections and differs only in some technical 
details from the one given in [8]. 
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4.1 Technical Lemmata 

Let us first note that TZ{V) is compact and therefore, for any sequence (a:n)„gj^ 
of feasible solutions such a limit point x exists. Let, in the following, {xk)ksK be 
a converging subsequence such that x = limfcgic,fc->-oo Xk- 

If we assume that the matrix Q satisfies the equation min/Amm(Q//) > 0 
where I ranges over all subsets of {1, . . . , m} such that |/| < q we can prove the 
following Lemma®: 

Lemma 1. Let {x^)keK be a eonverging subsequence. There exists an a > 0 
such that 



f{x'^)-f{x'^+^) > aWx'^+^-x'^f 

Proof. Let denote B{x^), = {1, . . . ,m} \ B^ and d := x^ — Since 

f is a quadratic function Taylor- expansion around yields 

fix'^) = f(x'^+^) + Vfix'^+Yd + ^d^Qd . (9) 

As x^~^^ is an optimal solution of the convex optimization problem Vgk we 

can conclude that the line segment L between x^ and x^~^^ lies in the feasibility 
region TZLPgk ^k ) of the subproblem induced by B^ and therefore = 

minix^L f{x). Thus, the gradient at in direction to x^ is ascending, i.e. 

Vf{x^+^y d>t) . (10) 

If a := i min/ Amm(Q 7 /) > 0; the Courant- Fischer Minimax Theorem [15] im- 
plies 



dJQd>2a\\df . ( 11 ) 

From (9), (10) and (11), the lemma follows. □ 

Lemma 2. For any I G N the sequence {x^^^)k<^K converges with limit point 
X and, as XiV f{x) is continuous in x, {XiV f{x^'^^)i)k^K converges accordingly 
with limit point AiV/(x)j for all i G {I, . . . , to } 

Proof. According to Lemma 1, {f{x'^))k^jc ® monotonically decreasing se- 

quence on a compact set and therefore converges and we are thus able to bound 
ll^-fe+i _ ^11 Jqj. k G K, as follows: 

Wx’^+^-xW < ||x'=+i-x'=|| + ||x'=-x|| 

< y^(/(^")-/(a^"+')) + Ik" - S|| 

® This holds if Q is positive definite. With respect to problems induced by SVM this 
tends to hold for small q or special kernels like for example RBF-kernels. For the 
special case of SMO {q = 2) Lemma 1 has been proven without this assumption [14]. 
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As f{x^) is a cauchy sequence and keK,k^oo^ term converges to zero 

for k G K. and k — >■ oo. Therefore converges with limit x. By induction 

the claim follows for any I G N. □ 



Lemma 3. For any j G {1, . . . , m} such that i ^ j and AiV/(x)i > XjV f{x)j 
and all I € N there exists a k G 1C such that for the next I iterations k' G 
{k , ... ,k + l} the following holds: 

Ifi,j are both selected in iteration k' , either i becomes bottom-only or j becomes 
top-only for the next iteration k' + 1. 

Proof. According to Lemma 2 we can find for any given / G N a fc G /C such that 
for any k' G {fc, ... , fc + ?} the following inequality holds: 

A*V/(x'='+i), > A,V/(x'='+i), . (12) 



Assume that i,j G B{x'^ ) for any k' G {k , ... ,k + l}. As both indices are in the 
working set, x^ is an optimal solution ofVr,, k'\ y ■ For sake of contradic- 

>yN(k') 

tion we assume that i is top candidate and j bottom candidate in iteration k'+l at 
the same time, i.e. z G hop,r{x^ and j G Ibot,r{x^ for an r G {1, . . . , s}®. 
In this case, as 



k' is decomposable by pairing. Theorem 1 implies 

‘'N(k') 



A*V/(x'='+i), < A,V/(x'='+i), . 



This contradicts to (12) and the choice of k' . 



□ 



4.2 Convergence Proof 

We are now in the position to state the main proof of the convergence theorem. 
For sake of contradiction, we assume there exists a limit point x which is not 
optimal wrt. V. 

In the following, we will concentrate on the set C> := {r \ [ir]‘^r\C{x) ^ 0} C 
{!,... , s} of equivalence classes which contribute to the selection candidates 
C{x) on the limit point x. As x is not optimal C> ^ 0. For any such equivalence 
class we select the most violating pair (z^, k^) as follows: 

(Lr,Kr) = argmax(i_j.)gC(s)n[i.]2 X^Vf{x)i - XjVf{x)j 

Based on these pairs, we define the following two sets for any r G C>: 

Tr ■■={i G [ir] I AiV/(x)i > At,.V/(x)t,,} 

Kr :={i G [ir] I XiVf{x)i < A„,,V/(x)«J 

For any iteration fc G N an index z G Ir\{tr} will then be called dominating z^ in 
iteration fc iff z G Itop,r{x^)- An index j G K.r\ {zzr} will be called dominating Kr 

® As z ~ ji such an r must exist 
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iff j G Ibot,r{x^) accordingly, will denote the number of all dominating indices 
in iteration k. Note that, as no i G [v] can dominate Lr as well as Kr in one 
iteration, is bounded by m — 2|C>|. We now claim, that there exists a, k G K. 
such that, for the next me := m — 2|C> | + 1 iterations k' G {k,. . . ,k + me}, the 
following two conditions hold: 1) d^ >0 and 2) which leads to the 

aspired contradiction. 

The k G 1C we are looking for can be chosen, according to Lemma 2, such 
that for the next me + 1 iterations all inequalities on x will be preserved, i.e. for 
k' G {k,. . . ,k + me + 1} the following holds: 



V(z,j) G {1, . . . ,m}2 : X,Vf{x)i < A,V/(x), ^ A,V/(x'=')i < A,V/(x'=),- , 

Xi > li ^ x^ > k and Xi < Ui ^ x^ < ut . 

Note that {tr, Kr) G Itop,r{x^ ) X hot,r{x^ ) for any such k' and thus, due to 
Lemma 3, they cannot be select at the same time. Let us now prove the two 
conditions introduced earlier for the selected k: 

1) As d* =0 would imply, that for any r G C> 

(tr,Kr) = argmaX(-_^-)gg(^fc.) )i - \jVf{x^ )j 

at least one pair {Lr,Kr) would be selected in the next iteration A:' + 1 which 
is, by choice of k, not possible for any k' G {k,. . . ,k + me}- Therefore d^ > 0 
holds for all such k' as claimed. 

2) To prove that d* will decrease in every iteration k' G {k, ... ,k + me} by 
at least one, we have to consider two aspects: First, there have to be vanishing 
dominating indices, i.e. a top candidate from Ir has to become bottom-only or 
a bottom candidate from K-r has to become top-only. Second, we have to take 
care that the number of non-dominating indices, which become dominating in 
the next iteration, is strictly bounded by the number of vanishing dominating 
indices. 

Note, that the state of an index i, concerning domination, only changes if i is 
selected, i.e. i G B{x^ ). As we will only be concerned with such selected indices, 
let us define the following four sets: 

T+ := Tr n h^r{x^') n B{x^') , T- := I, \ {i,} n hotA^'^') n Bix'^') , 

/c+ := iCr n lAAA"') n b{x’^') , /C" := iCr \ {ir} n hopA^'"') n 

contains the selected indices dominating ir in the current iteration while 
It contains the selected indices from Ir\{ir} currently not dominating ir- K.^ 
and K-T are defined accordingly. Let us first state a simple lemma concerning 
vanishing dominating indices: 

Lemma 4. In the next iteration all indices from Iff will become bottom-only or 
all indices from Kf will become top-only. 

In particular, iflif yf 0 and ICf = % Kr is selected and all indices from lif will 
become bottom-only. The same holds for /C^ yf 0 and lif = 0 accordingly. 
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Proof. If both sets are empty there’s nothing to show. Without loss of generality 
we assume lif ^ In this case there have to he selected bottom candidates from 
[ir] as the indices are selected pairwise. 

If ICf = 0, by choice ofk, the first selected bottom candidate has to he Kr. As Kr 
is a bottom candidate in the next iteration as well, the claim follows according 
to Lemma 3. 

If ICf ^ 0, the assumption that a pair (i,j) G Ifb x ICf exists, such that i will be 
top and j will be bottom candidate in the next iteration, contradicts to Lemma 3. 

□ 

The next lemma will deal with indices currently non-dominating but, in the 
next iteration, eventually becoming dominating: 

Lemma 5. If If ^ 0 the following two conditions holdf^: |I“| < |I+| and 
ICf = 0. The same holds for /C“ respectively. 

Proof. If If ^ 0 all selected top candidates dominate Lr and ICf is therefore 
empty. As the indices are selected pairwise in every class, it holds that 2|I+| = 
\B{x^ ) n \ir\ \ > 0. In addition, Kr G ) fl [v] and it therefore follows that at 

least one selected bottom candidate is not in Ir and thus \If\ + 1 < 121,^1 • □ 

To finalize the proof, we note that for at least one r G Cy there have to be 
dominating indices, i.e. 1+ yf 0 or fCf yf 0. Otherwise, by choice of k, {ir,Kr) 
would be selected for some r. 

Thus, according to Lemma 4, the number of vanishing dominating indices is 
strictly positive and, according to Lemma 5, the number of non-dominating in- 
dices, eventually becoming dominating, is strictly smaller. This proves condition 
2) and, with condition 1), leads to a contradiction as mentioned above. 

Thus the assumption that a limit point x is not optimal has to be wrong and 
the decomposition method, based on the generalized gradient selection, converges 
for problems decomposable by pairing. □ 



5 Final Remarks and Open Problems 

We have shown that the well-known method of selecting the working set accord- 
ing to the gradient of the objective function can be generalized to a larger class of 
convex quadratic optimization problems. The complexity of the given extension 
is equal to Joachims’ selection algorithm except for an initialization overhead for 
the calculation of the classes of indices and the Aj. Thus implementations like 
SVM**®^* [5] can easily be extended to solve such problems with little extra cost. 

We would like to point out that the given selection algorithm and the ex- 
tended convergence proof hold for most variants of SVM including the I/-SVM 
for which the proof of convergence of a decomposition method with a gradient 
selection strategy (e.g. [12]) has not been published yet. 

It would be interesting to eliminate the restriction on the matrix Q (see 
footnote 8) and to extend the proof in [14] to working sets with more than two 

We briefly note, that such candidates might only exist if ir is top-only. 
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indices. A more interesting topic for future research would be the extension of 
known results concerning speed of convergences (e.g. [16], [17]) to the extended 
gradient selection approach proposed in this paper. 

Acknowledgments. Thanks to Hans Simon for pointing me to a more elegant 
formulation of Definition 3 and to a simplification in the main convergence proof. 
Thanks to Dietrich Braess for pointing out the simpler formulation of the proof 
of Lemma 1. 
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Abstract. Many singular learning machines such as neural networks 
and mixture models are used in the information engineering field. In 
spite of their wide range applications, their mathematical foundation of 
analysis is not yet constructed because of the singularities in the param- 
eter space. 

In recent years, we developed the algebraic geometrical method that 
shows the relation between the efficiency in Bayesian estimation and the 
singularities. We also constructed an algorithm in which the Newton di- 
agram is used to search of desingularization maps. Using the Newton 
diagram, we are able to reveal the exact value of the asymptotic stochas- 
tic complexity, which is a criterion of the model selection. 

In this paper, we apply the method and the algorithm to a mixture of 
binomial distributions and clarify its stochastic complexity. Since our 
result is given by the mathematically rigorous way, it can contribute 
to the evaluation of the conventional approximations for calculating the 
stochastic complexity, such as the Markov Chain Monte Carlo and Vari- 
ational Bayes methods. 



1 Introduction 

In the information engineering field, many kinds of learning machines such as 
neural networks, mixture models and Bayesian networks are being used. In spite 
of the wide-range applications and technical learning algorithms, their mathe- 
matical properties are not yet clarified. 

All learning models (or learning machines) belong to either category, identifi- 
able or non-identifiable. A learning model is generally represented as a probability 
density function p{x\w), where w is a, parameter and cc is a feature vector of data. 
When a machine learns from sample data, its parameter is optimized. Thus, the 
parameter determines the probability distribution of the model. If the mapping 
from the parameter to the distribution is one-to-one, the model is identifiable, 
otherwise, non-identifiable. 

There are many difficulties in analyzing non-identifiable models using the 
conventional method. If the learning model attains the true distribution, the 
parameter space contains the true parameter(s). In non-identifiable models, the 
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set of true parameters is not one point but an analytic set in the parameter space. 
Because the set includes many singularities, the Fisher information matrices are 
not positive definite. Thus, the log likelihood cannot be approximated by any 
quadratic form of the parameters in the neighborhood of singularities [2], [21]. 
That is one of the reasons why the non-identifiable model cannot be clarified. 
We refer to this model as a singular model. 

In Bayesian estimation, the stochastic complexity [12], which is equal to 
the free energy or the minus marginal likelihood, is very important. Using this 
observable, we can select the optimal size of the model and derive its general- 
ization error. It is well known that the stochastic complexity is equivalent to 
BIG in identifiable (statistical regular) models [13]. However, it does not hold in 
singular models. 

The importance of the analysis for the non-identifiable model has been re- 
cently pointed out [9]. In some models such as mixture models, the maximum 
likelihood estimator often diverges. Dacunha-Castelle and Gassiat proposed that 
the asymptotic behavior of the log likelihood ratio of the maximum likelihood 
method can be analyzed based on the theory of empirical processes by choosing 
a locally conic parameterization [4] . Hagiwara has shown that the maximum like- 
lihood method makes training errors very small, but conjectured that it makes 
generalization errors very large [8]. It is well known by many experiments that 
the Bayesian estimation is more useful than the maximum likelihood method 
[ 1 ], [ 111 - 

In recent years, we have proven that the singularities in the parameter space 
strongly relate to the efficiency of the Bayesian estimation based on algebraic 
geometry. This relation reveals that the stochastic complexity is determined by 
the zeta function of the Kullback information from the true distribution to the 
learning model and of an a priori distribution. The analysis of the stochastic 
complexity results in finding the largest pole of the zeta function. Using this 
method, we have clarified the upper bounds of the stochastic complexities in 
concrete models, such as multi-layered perceptrons, mixture models, Bayesian 
networks and hidden Markov models [22] , [23] , [24] . Though we are actually able 
to analyze these singular models, it is not easy to find the largest pole. Finding 
the largest pole is equivalent to finding a resolution of singularities in the Kull- 
back information according to the algebraic geometrical method [14]. However, 
if the Kullback information satisfies a non-degenerate condition (Definition 3 in 
Section 3), we can systematically derive a desingularization based on the Newton 
diagram [3] , [5] , [6] . The problem is that almost all singular models do not satisfy 
the condition. It seemed impossible to apply the method of the Newton diagram 
to these models by choosing an appropriate variable. However, we constructed 
an algorithm to make the Kullback information satisfy the condition [25] . 

In this paper, we apply the algorithm to a real, concrete and practical model, 
a mixture of binomial distribution, and reveal the stochastic complexity. The re- 
sult of this paper is not its upper bound but the exact value. This would construct 
a new criterion to select the optimal sized model in terms of the marginal like- 
lihood. Moreover, the procedure of the proof shows advantages of the method 
with the Newton diagram to analyze other singular models. 
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2 Bayesian Learning and Stochastic Complexity 

In this section, we introduce the standard framework of Bayesian estimation. 
They are well known in statistical learning theory. 

Let X” = {Xi,X2, - ■ ■ ,Xn) be a set of training samples that are indepen- 
dently and identically distributed. The number of training samples is n. These 
and the testing samples are taken from the true probability distribution q{x). 
Let p{x\w) be a learning machine. The a priori probability distribution tpiw) is 
given on the set of parameters W. Then, the a posteriori probability distribution 
is defined by 

1 ” 

p{w\X^) = ^ ^ ip{w) Y[p{Xi\w), 

i=l 



Zo(X") 



where Zo(X"‘) is a normalizing constant. The empirical Kullback information is 
given by 

71 f ^ 



' p{X,\w)' 



■ exp{-nHn{w)) (p{w), 



Then, p(w|X”) is rewritten as 

P(»iv) = 

where the normalizing constant Z(X") is given by 

Z(X”) = J exp{—nHn{w))(fi{w)dw. 

The stochastic complexity is defined by 

F{X^) = -logZ(X"). 

We can select the optimal model and hyperparameters by minimizing 
— logZo(X"). This is equivalent to minimizing the stochastic complexity, since 

-logZo(X") = -logZ{X^) + S{X^), 

n 

S{X^) = -J2^ogq{X,), 



where the empirical entropy S(X^) is independent of the learners. The average 
stochastic complexity F{n) is defined by 



F{n) = -Ex^ logZ(X”) 



( 1 ) 



where Ex«[-] stands for the expectation value over all sets of training samples. 
The Bayesian predictive distribution p(x\X'^) is given by 

p{x\X") = j p{x\w)p{w\X'^)dw. 
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The generalization error G(n) is the average Kullback information from the true 
distribution to the Bayesian predictive distribution, 

G{n) = Exr^[J q{x) log . 

Clarifying the behavior of G(n), when the number of training samples is suffi- 
ciently large, is very important. The relation between G{n) and F{n) is 

G{n) = F{n+1) - F{n). (2) 

This relation is well known [10], [14], [20] and allows that the generalization error 
is calculated from the average stochastic complexity. If F{n) is obtained as 

F{n) = Alogn, 

and the generalization error has the asymptotic expansion, it is given by 

G{n) = -+of-\ . 
n \n / 

This A is referred to as the learning coefficient. 

If a learning machine is an identifiable and regular statistical model, it is 
proven [13] that asymptotically 

F{n) = - log n + const. 

holds, where d is the dimension of the parameter space W. However, for models 
that are non-identifiable and non-regular such as artificial neural networks and 
mixture models, the different results are derived [21]. We define the Kullback 
information from the true distribution q(x) to the learner p(x\w) by 

H{w)= f q{x)log (3) 

J p(x\w) 

and assume that it is an analytic function. The asymptotic expansion of F{n) is 

F{n) = A log n — (m — 1) log log n -I- const, 

where the rational number (—A) and the natural number m are respectively equal 
to the largest pole and its order of the zeta function of the Kullback information 
and the a priori distribution [15] defined by 

J{z) = J F[{wY'p{w)dw, (4) 

where z is a complex variable. The zeta function is holomorphic in the region 
Re{z) > 0, and can be analytically continued to the meromorphic function on 
the entire complex plane, whose poles are all real, negative and rational numbers. 
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We can calculate A and m by using a resolution of singularities in algebraic 
geometry [15]. By finding the resolution map g{-), we can represent the Kullback 
information eq. (3) as 

H{giu)) = uru^^---u^/. (5) 

Then, the largest pole A and its order m are found by integrating eq. (4). How- 
ever, it is not easy to find the largest pole since it is generally difficult to find 
the resolution map g{-). Instead of the largest one, we can find a pole of J{z) 
using a partial resolution map that is given by a blow-up. It is known that a pole 
determines an upper bound of A. Then, upper bounds can be derived in some 
models such as multi-layer neural networks [16], mixture models [22], Bayesian 
networks [23] and hidden Markov models [24]. In this paper, the largest pole in 
a mixture of binomial distributions is clarified by using the Newton diagram. 
The result shows not an upper bound but the exact asymptotic expansion of the 
stochastic complexity. 

3 Newton Diagram and Resolution of Singularities 

In this section, we introduce the Newton diagram and its relation to a resolution 
of singularities. According to this method, we can write the analytic function 
eq. (3) as an expression eq. (5) and find the largest pole of the zeta function. 
Let the Taylor expansion of an analytic function H(w) be 

H{W) = CyW^, 

V 

where w = (wi, • • • , Wd) G v = (vi, ■ ■ ■ , Vd) € Q C and Cy is a constant. 
We use the notation that 



w 



Vd 

d ■ 



Definition 1. The convex hull of the subset 

{v + v'; Cy yf 0, u' e 

is referred to as the Newton diagram r+{H). 

Example 1. Assume H{w) is defined by 

H{w) = -I- w\w\ + w\w 2 + W 2 -, 

where w = {w\,W 2 ), v = (5, 0), (3, 3), (2, 2), (0, 5) and c„ = 1. Then, the Newton 
diagram is depicted by the shaded area (Fig. 1 (a)). 

For a given constant vector a G Z'^, we define l{a) by 

/(a) = min{(w, a); v G r+(iJ)}, 

where (, ) is the inner product, {v, a) = 
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Fig. 1. (a) The Newton diagram, (b) the fan, (c) the snbdivided fan 



Definition 2. A face of r+{H) is defined by 

7(a) = {v G r+{H); (v,a) = Z(a)}. 

Intuitively, the face is the border of the Newton diagram. 

Example 2. On the condition of Example 1, the Newton diagram has four faces, 
/i=[(0,5),(2,2)], /2=[(2,2),(5,0)], [(0, 5), (0, oo)], [(5, 0), (oo, 0)]. (Figure 1 (a)). 

Depending on a face 7, a polynomial f-y is defined by 



Definition 3. The function H{w) is said to he non-degenerate if and only if 
lie G = ••• = = o| C {wi---Wd = 0} 

for an arbitrary compact face 7 of I + (i7). If otherwise, H(w) is said to be 
degenerate. 

Example 3. It is easy to show the function H (w) in Example 1 is non-degenerate. 
Consider the dual space P C Z'^ of Q. 

Definition 4. The set of the orthogonal vectors to the faces are called a fan. 

Consider the parallelogram constituted by arbitrary two vectors of the fan. If it 
includes a point of P, add the vector from the origin to the point. 

Definition 5. Subdivision of P+{H) is defined by adding the vectors until there 
is no point in the parallelograms. 
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Example 4- The fan and the subdivided one of the Newton diagram in Example 
1 are respectively depicted by Fig. 1 (b) and (c). 

• • • of \ 



w = 



where a set, oi = (a}, • • • , a^), ■ ■ ■ ,ad = (af, • • • , a^) is a part of the subdivided 
fan. By using the above definitions, the theorem is known which relates the 
Newton diagram to a resolution of singularities [5], [6]. 

Theorem 1. The map tt{u) : w = ,wheredet A = ±1 is called a real toric 
modification. The map TT~^{Ufi) — >■ W for the neighborhood Uq C of the origin 
is a resolution of singularities if the function H{w) is non-degenerate. 




For a matrix 



A = 



and w,u G we define 



Example 5. According to the subdivided fan in Example 1, we can select two 
vectors, (3,2), (1, 1). Then, the map g(ui,U 2 ) is defined by 

( wi = ufu2 

\w 2 = u\u\ 



This gives an expression of eq. (5), 

H{g{u)) = u\^U 2 {u\u 2 + u\u 2 + 1 + U 2 ). 

We can find a resolution of singularities at the origin considering all sets of 
vectors such that det A = ±1. Each A provides local coordinates. 

In order to find the largest pole of the zeta function determined by the eq. 
(4), the problem comes down to find an efficient vector of the subdivided fan 
in the Newton diagram. The largest pole depends on the ratio between the 
Jacobian \g'(u)\ and the power of the common factor in H(g(u)). If a vector of 
the subdivided fan is aj and H{g{u)) has the common factor , it settles a pole 
—a/ P, where a = . 

4 Main Result 



In this section, let us introduce mixtures of binomial distributions and our result. 
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4.1 Mixture of Binomial Distributions 

A mixture of binomial distributions is formulated by 



p{x = k\w) 




(6) 



where N, K are integers such that K < N, k = 0,1, - ■ ■ ,N, {N k)'^ is the number 
of combination of N elements taken fc at a time, and 

is a parameter such that 0 < pi < 1/2, Oi > 0, and 



K-l 

ax+i = 1 - ^ Oi. 

i-1 



A binomial distribution is defined by 

where 1 < p < 1/2. Thus, the mixture eq. (6) has K components. 

This learning machine is used for the gene analysis and the mutational spec- 
trum analysis [7]. 



4.2 Main Result 

We use the notation, 

f{x) X g{x), 

where there are positive constants Ci , C 2 such that 

Cig{x) < f{x) < C 2 g{x). 

Assume that the true distribution consists of Kq components. 



q{x = k) 




( 7 ) 



where 0 < p* < 1/2 are constants, p\ < p\ < • • • < a* > 0 and a*j^^ = 

Theorem 2. If the learning machine is given by the eq. (6) and the true distri- 
bution is given by the eq. (1), the Kullback information eq. (3) satisfies 



N 



K 



2^ajPj - 



fc=i 



0=1 




2 
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Proof. Using the notation, we can express the Kullback information eq. (3) as 



N 



H{w) X ^ [p{x = k\w) — q{x = fc)]^ . 



k=0 



According to Fourier transform and Parseval’s identity, it is rewritten as 



N 



H{w) X ^ 






N 



^exp 



K 



27rit 

WTi' 



Ko 



X ^a,(l 



i=i 



> N 



Because of the orthogonality of |exp | , 

f 



N K 



HM ^ E 1 E - E - p*,)^-'^pT 



" f,* 



Ko 

E 

i=i 



Ko 






k=i ii=i 



i=i 



□ 



The following is the main result of this paper. 

Theorem 3. If the learning machine eq. (6) has AT + 1 components and the 
true distribution eq. (7) has K components, then, for a sufficiently large natural 
number n, the stochastic complexity satisfies the equation, 

F{n) = log n + C, 

where C is a constant independent of n. 

Proof. According to Theorem 2, the Kullback information satisfies 

N ( K+1 K 

H{w) X E s E ^iPj - E 

fc=i j=i j=i 

We define the map 6>i : w ^ wi, such that 

a'j = aj-a*_i (j = 2, • • • , AT + 1), 

Pj=Pj-Pj-i (j = 2,---,a: + i), 

a[ = ai, 

Pi =Pi -Pi, 





Newton Diagram and Stochastic Complexity 



359 



in order to shift the singularities to the origin. We write these variables, o' and 
p'j as aj and pj respectively to avoid the complicated notation. Then, 

N ( K 

X 'H(wi) = X! 1 '^2(fc)(ai + 02 ) +y^Cj(fc)oj- 

fc=l [ j=3 

+ci(fc)(aipi + (o2 + a\)p2) 

+di{k){aip\ + (o2 + a\)pl) 

K+l 

+ X! + fr{k,w) 

i=2 

where 




C2(fc) = pf - 

cj{k)=pf-pf 0- = 3,---,iC), 
ci(fc) = 

di(fc)= (di(l)=0), 

d,(fc) = (j = 2,...,iC), 

/ tc-i 

dK+i{k) = A: I 1 - g* 

and fr{k,w) is the sum of the remaining terms in In order to find the 

faces, we rewrite the function as 

"H(wi) = hi{wiY + 2hi{wi)h2{wi) + h2{wif , 




where 

N ( K K+l \ 

C2{k){ai + a2) + E Cj{k)aj + dj{k)pj \ , 

fe=i [ i=3 j=2 J 

N 

h2{wi) = Y, {c\{k){aipi + {q 2 + a\)p2) 

k=l 

+di{k){aip\ + (a2 + a^)pi) + fr(k,w)j . 

The terms in /i 2 (tC 2 ) are higher order than that of hi(wi). Because the vectors 
are linearly independent, 

{(c2(fc), • • • , CK(k), d2(k),- ■ ■ , dK+i(k))}k=i, 

N N ( K 1 ^ f 1 

hiiwif X Y{<^2{kf}{ai + a2f + Y\ ^<^Akf f + E E \p]- 

k=l k=l [i=3 J k=i [ i=2 J 
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Thus, the faces in the Newton diagram of TL{wi) consist of the terms, 

(oi + 02)^, 

p] (j = 2,...,X+l). 

Because of the term, (01 + 02)^, the function "H(mi) is degenerate. We define the 
map 6>2 : — >■ UI2, 

0,2 = ^2 1 

a'j=aj (j = 1,3,4, •••, if), 

Pj=Pj (j = 

We write these variables, a' and p' as aj and pj respectively to avoid the com- 
plicated notation. Then, 'H{w\) is rewritten as 

T-L{02^ {W2)) = h-^{w2Y + ‘2h[{w2)h'2{w2) + /12(W2)^, 

where 

N ( K 

K{w 2 ) = XI S C2(fc)a2 + X Ci(fc)aj 

k=i y j=3 

K+l 

+ X + C\{k){aipi + (tt2 - oi -I- 01)^2) 

i=3 
N 

^2(^2) = X! + («2 - ai + al)pl) + /^(A:,^;)} , 

and f^{k,w) is the sum of the remaining terms in 'H{ 02 ^{w 2 ))- The terms in 
^-2(^2) are higher order than that of h'i(w2)- According to the linear indepen- 
dency of the vectors, 

{(ci(fc), • • • , CK{k), d2{k),- ■ ■ , dK+i{k))}k=i, 

N N ( K \ N ( K+l "I 

hi{w 2 f X X {c2(fc)^}a2 + X 1 f +51 1 55 (Pj 

k=l k=l [i=3 J k=l [i=3 J 

N 

+ X (“iPi + (“2 - ai + ++ 2 )^ • 

k=l 

Thus, the faces in the Newton diagram of 'H{ 02 ^{w 2 )) consist of the terms, 

(oipi -I- a+2)^, 

(j = 3,---,X), 

(j = 3,---,X+l). 
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Because of the term, {a\pi + 0 *^ 2 )^, the function 'H{02 ^(tC 2 )) is degenerate. 
Thus, we define the map 0^ \ W 2 ^ 

a'j = aj {j = ,K), 

p '2 = aipi + (o2 - ai + a*)p2, 

p'j=Pj (j = 1,3,4, 

We write these variables, a' and p' as aj and pj respectively to avoid the com- 
plicated notation. Then, 'H{02^{w2)) is rewritten as 

'H{02^0^^{w^)) = h^{wzf + 2h'l{wz)h'2{w^) + h2{wsf , 



where 

N ( K K+1 

h'liws) = C2{k)a2 + ^2 Cj{k)aj + ^ dj{k)pj + Ci{k)p2 

fc=i y i=3 j=3 

— ai + a 

N 

Kiwz) = ^f”{k,w), 

k^l 

and f"{k,w) is the sum of the remaining terms in 'H{02^0^^{w^)). The terms 
in h' 2 {w 3 ) are higher order than that of h'^ws). According to the linear inde- 
pendency of the vectors. 




{(ci(fc), • • • , CK{k), di{k),- ■ ■ , dK+i{k))}^^i, 



N N ( K 

h'iiws)^ X {c 2 {kf} 02 + XI ^ 

k —1 k=l I j — 3 



N ( K+1 



a 



'3 ' X/ 1 X/ *^o{kY 

fc=l I i =3 



k=l fc=l ^ z 1 1 / 



Thus, the faces in the Newton diagram of 'H{02 ^©3 ^(^ 3 )) consist of the terms, 

a\p\, 

a] {j = 2,---,K), 
p] (j = 2,...,iC+l). 

Finally, the function 'H{02^ 0^^ {wz)) is non-degenerate. Then, we are able to 
find the efficient vector. 



{ax,a2+ ■ ■ ,aK,Pi,P2+ ■ +PK+i) = (0, 2, • • • , 2, 1, 2, • • • , 2). 
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According to the fan, a resolution of singularities g{u), where 



U= {U1,U2, - ■ ■ ,UK,Vi,V2, - ■ ■ ,VK+l) 



is derived as 



Oi = Ui, 

aj = Ujvl {j = 2,--- ,K), 

Pi = vi, 



Pj = Vj (i = 2, + 



The Jacobian of g{u) is 



l5(«)l=^i 

'2 ^^3 



4K-2 



and the common factor of 'H{02 ^ 6*3 is vf. Therefore, the largest pole A' 

of 



'^{^2 ^{g{u))Y\g{u)\du 



is 



= K- 



1 

4' 



The Jacobians of maps 01 , 02 , 6*3 are not equal to zero. Therefore, the largest 
pole of eq. (4) is same as A'. □ 



5 Discussion and Conclusion 

First, let us summarize the advantages of the method to use the Newton diagram. 
There are two advantages for the analysis of singular learning machines, (a) We 
are able to ignore higher-order terms which are inside of the Newton diagram, 
(b) A resolution of singularities is systematically found by using the fan. In some 
examples of Section 3, we focused not the whole diagram but only the faces and 
the corresponding orthogonal vectors to find a resolution of singularities. For 
instance, the term icfruf of H{w) in Example 1 does not affect the resolution 
of singularities in Example 5. In the proof of Theorem 3, we ignored fr, fY fj- 
This makes the analysis much simple. This means we need to consider only the 
terms in the faces of the Kullback information. In general, it is not easy to 
find which terms determine a resolution or to derive a resolution of singularities 
from all terms in the Kullback information. Moreover, the number of resolutions 
exponentially grows when the dimension of the parameter increases. This is one 
of the reasons to use a partial resolution (blow-up) in the previous studies [ 22 ], 
[23], [24]. Based on these advantages, we revealed not an upper bound but the 
exact value of A, the coefficient of the leading term in the stochastic complexity. 
The process of the proof claims that they are effective in the real, concrete and 
practical models like the mixture. It is expected the method with the Newton 
diagram can be applied to other singular models. For example, we obtained an 
upper bound of general mixture models [22]. If the components are binomial 
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distributions, the upper bound is A < iC+ 1/2. Our result shows that it can be 
tighter on some conditions. 

Second, let us talk about the degenerate Kullback information. As above 
stated, the method with the Newton diagram is useful to analyze learning ma- 
chines with singularities. However, it assumes that the Kullback information is 
non-degenerate. In fact, most Kullback informations of the non-regular models 
are degenerate. It depends on the coordinate of the parameter whether a function 
is degenerate or not. Then, we constructed the algorithm that makes the Kull- 
back information non-degenerate [25]. This means that there exist simple and 
complicated degenerate functions. In the Kullback information, the fans of the 
Newton diagram generally include the polynomial /(ru)^. If f{w) has a primary 
expression such that f{w) = a + be + ■ ■ ■, we can apply the algorithm. This is a 
simple degenerate function. On the other hand, we cannot do it if the smallest 
order term of f{w) is a quadratic expression. This is a complicated degenerate 
function. A mixture of binomial distribution in the main theorem belongs to a 
simple one. However, the Kullback information is complicated degenerate if the 
number of component is general. It is still an open problem to classify singular 
machines into two categories, simple and complicated degenerate functions. It is 
equivalent to finding how generally applicable the algorithm is. 

At last, let us discuss the model selection problem. We clarified the stochastic 
complexity in a mixture of binomial distributions using the algebraic geometrical 
method and the Newton diagram. As we mentioned in Section 2, the stochastic 
complexity is a criterion of the model selection. In statistical regular models, it is 
equivalent to the well known BIC [13]. However, it is not in non-regular models. 
The dimension of the parameter w is d = 2K — 1. So, the coefficient A is smaller 
than d/2. Furthermore, by using our result, the conventional approximations 
to calculate the stochastic complexity such as Markov Chain Monte Carlo and 
Variational Bayes methods can be evaluated. 
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Abstract. In this paper we study the learning complexity of a vast class 
of quantifed formulas called Relatively Quantified Generalized Formulas. 

This class of formulas is parameterized by a set of predicates, called a ba- 
sis. We give a complete classification theorem, showing that every basis 
gives rise to quantified formulas that are either polynomially learnable 
with equivalence queries, or not polynomially predictable with member- 
ship queries under some cryptographic assumption. We also provide a 
simple criteria distinguishing the learnable cases from the non-learnable 
cases. 

1 Introduction 

The problem of learning an unknown formula has been widely studied. It is well- 
known that the general problem of learning an arbitrary formula is hard, even 
in the case of propositional formulas, under the usual learning models [3,16]. 
This general non-learnability motivates the research program of distinguishing 
restricted classes of formulas that are learnable from those that are not learn- 
able. Towards this end, a number of efficiently learnable classes of propositional 
formulas have been identified, for instance [1,2,6,14]. 

An attractive framework of formula classes was studied by Schaefer [21]; 
he considered generalized formulas, defined as conjunctions of predicates, and 
considered restricted classes of generalized formulas where the predicates must 
come from a restricted predicate set, called a basis. Schaefer proved a now clas- 
sic dichotomy theorem: for any basis over a two-element (boolean) domain, the 
problem of deciding if a generalized formula over the basis is true (that is, has 
a model) is either polynomial time solvable or NP-complete. He also provided a 
simple criterion to distinghish the tractable cases from those that are not. Gen- 
eralized formulas have, since Schaefer’s theorem, been heavily studied in several 
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different contexts [4,7,8,9,10,17,18]; they are widely acknowledged as a useful 
way of obtaining restricted versions of logical formulas in order to systemati- 
cally study the distinction between tractability and intractability.^ 

In the context of learning, Dalmau [12], along with Jeavons in [13], has stud- 
ied the learnability of such formula classes. In particular, Dalmau [12] proved an 
analog of Schaefer’s theorem in the setting of learnability. The case of bases over 
domains of size greater than two was considered by Dalmau and Jeavons [13]: 
they established general algebraic technology for studying learnability of gener- 
alized formulas, and gave some broad sufficient conditions for learnability. 

In this paper, we continue this line of investigation by proving a classifica- 
tion theorem on the learnability of an very broad class of quantified generalized 
formulas. We consider relatively quantified generalized formulas, where quantifi- 
cation is allowed over any subset of the domain of a variable. We prove a full 
classification theorem on such relatively quantified formulas, showing that for any 
basis over a finite domain, the concept class consisting of relatively quantified 
formulas over the basis is either learnable in the exact model with equivalence 
queries, or non-learnable in the PAC-prediccion model with membership queries. 
Our classification theorem thus demonstrates an extremely sharp dichotomy on 
a diverse family of concept classes: none of the concept classes characterize any 
of the learning models intermediate between the two that we have just named, 
such as exact learning with equivalence and membership queries, or PAG learning 
(either with or without membership queries). 

One of the primary tools we use to obtain our classification theorem is the 
algebraic approach to studying learnability of formulas [13]. This approach has 
demonstrated that the learnability of generalized formulas over a basis depends 
only on a set of operations called the polymorphisms of the basis. Our main 
theorem states that relatively quantified formulas over a basis are learnable if 
and only if the basis has a particular type of polymorphism which we call a gen- 
eralized majority-minority (GMM) operation. Our positive learnability results 
demonstrate that quantified formulas with a GMM polymorphism have a novel, 
compact representation which we call a signature. Our negative results use alge- 
braic techniques to show that quantified formulas without a GMM polymorphism 
must have a two-element domain that is non-learnable by the dichotomy theo- 
rem of Dalmau [12]. We believe it remarkable that our negative results, which 
can be applied to bases over domains of arbitrary size, can ultimately be derived 
from the negative results for two-element bases. 

2 Preliminaries 

2.1 Formulas 

We use [n] to denote the set containing the first n positive integers, that is, 
{1, . . . , n}. Let V = {x\,X 2 , . . . } be a countably infinite set of variables and let 
A be a finite set called the domain. A fc-ary relation {k integer) on A is a subset 

^ See [11] for a monograph on boolean generalized formulas. 
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of where k is called the rank or arity of the relation. The set {1, . . . , A:} is 
referred to as the set of indices or coordinates of R. 

We use the term formula in a wide sense, to mean any well-formed formula, 
formed from variables, logical connectives, parentheses, relation symbols, and 
existential and universal quantifiers. 

Definition 1 ([13]). Let A be any finite set and let F = {Ri, i? 2 , • • • } be any 
set of relations over A, where each Ri has rank ki. (Ri denotes both the relation 
and its symbol.) 

The set of quantified generalized formulas over the basis F, denoted by 
V3-Form(T), is the smallest set of formulas such that: 

(a) For all R & F of rank k, R{yi, . . . ,yk) G V3-Form(T) where yi € V for 
l<i<k. 

(b) For all G V3-Form(T), <P AT G V3-Form(T). 

(c) For all G V3-Form(r) and for all x GV, 3xF G V3-Form(T). 

(d) For all G V3-Form(F) and for all x gV , 'ixT' G V3-Form(r). 

If we remove condition (d) in the previous definition we obtain a reduced class 
of formulas called existentially quantified generalized formulas over the basis F, 
denoted by 3-Form(F). 

In this paper we focus our attention on a related family of formulas, 
called relatively quantified generalized formulas over the basis F, and denoted 
V3 i.-Form(T). In a V3r-Form(F) formula, every occurrence of a quantifier 
Q G {3,V} is followed by an expression of the form (x G B) where i? is a 
subset of A. The intended meaning of the expression {x G B) is to indicate that 
the quantification of x is relative to the elements from B. That is, the semantics 
of 3{x G B)<I> is “there exists a 6 in i? such that F holds when x is substituted 
with 6”, similarly, V(a; G B)F is interpreted as “for all b in B, <P holds when x is 
substituted with b” . The relatively existentially quantified generalized formulas 
3 i.-Form(T) are defined similarly. 

It can be immediately observed that usual quantification can be simulated by 
performing all quantification relative to A, and hence V3-Form(F) is contained 
in V3r-Form(F) for each F. Indeed, the relationship between V3-Form(F) and 
V3i.-Form(T) is even closer, as one can easily observe the following: 

Proposition 1. Let F be any finite set of relations containing every unary re- 
lation. T/ien V3r-Form(F) (3r-Form(F)J and V3-Form(F) (3-Form(F)J^ are 
equivalent. 

Proof. Clearly, V3-Form(F) C V3 i.-Form(T). We prove the converse by induc- 
tion on the structure of a formula <L. The base case of induction, G F, is 
obvious. Now, if d>,F G V3-Form(T), then AT G V3-Form(T); 3(a; G B)L> is 
equivalent to 3x{<P A B{x)), where B is the predicate interpreted as the unary 
relation B; and V(a; G B)<P, B = {bi , . . . , bk}, is equivalent to 

{3x A b'i{x)) A ... A {3x A b'i.{x)), 

where the bi(x) is a predicate interpreted as the unary relation {(6i)}. □ 
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In our analysis the size of our formulas will be relevant. By the size of a 
formula denoted by \<P\, we mean the length of a string that encodes it (in 
any reasonable encoding). Let F, G be two families of formulas. We shall say that 
F is polynomially contained in G if there exists a polynomial p such that for each 
formula <P in F there exists an equivalent formula if' in G such that |if'| < p{\^\)- 
If furthermore G is polynomially contained in F then we say that F and G are 
polynomially equivalent. By looking at Proposition 1 with this definitions on 
hand we can easily infer that for each F, V3-Form(G) is polynomially contained 
in V3 i.-Form(G). Furthermore, if F contains all unary relations then 3-Form(F) 
and 3r-Form(F) are polynomially equivalent. 

Each formula defines a relation FLp if we apply the usual semantics of first- 
order logic, and the variables are taken in lexicographical order. More formally, 
let be a formula over the free variables , . . . , Xi^ where ii < i 2 < ■ ■ ■ < 

im', we define 



= {(ai, tt 2 , . . . , ttm) '■ Xi- = ttj (1 < j < m) satisfies 



Example 1. Consider the problem of learning a boolean formula formed by a 
quantified conjunction of clauses with three literals per clause. Every such for- 
mula can be expressed as a formula in V3r-Form(F) with the set of logical 
relations F = {Rq, Ri, R 2 , R^}, defined by: 



Ro{x,y,z) = xy yy z, 

Ri{x,y,z) = xy yy z, 

R2{x,y,z) = xyyyz, 

Rd,{x,y,z) = xyyyz. 

Indeed, formulas in V3 r-Form(G) have even a bit more of expressivity power 
(due to the relativized quantification). As an example, let (P be following formula: 

= 3 (xi G {0})3(x2 G {l})V(a:3 G {0, l})3(x4 G {0, 1})V(:t5 G {0, 1}) 

Ri{x3, X4, X5) A i?i(a; 6 , X5, X4) A R2{xq, X7, a;i) A Rs{x2, xs, X4) 



^ is a formula in V3r-FormF over the free variables xq, xj, and xg. 
contains exactly all the assignments over these variables satisfying F. 



= {( 0 , 0 , 0 ), ( 0 , 1 , 0 ), ( 1 , 0 , 0 )} 

Let R be an n-ary relation on a finite set A. For a set / = |ii, . . . ,im} C 
{!,... ,n}, 1 < ii <■■■< im ^ n we define the projection of a tuple a = 
(oi , . . . , a„) over I to be the tuple prj a = (oij , . . . , ), and the projection of 

R over / to be the relation pr^ = {pr^ a | a G i?|. 
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2.2 Learning Preliminaries 

We use two models of learning, both of them fairly standard: the model of exact 
learning with equivalence queries defined by Angluin [1] and the model of PAC- 
prediction with membership queries as defined by Angluin and Kharitonov [3]. 
We have chosen these models because each of them represents an opposite ex- 
treme in the spectrum of the usual learning models. Consequently, by proving 
our positive results in the model of exact learning with equivalence queries and 
our negative learnability results in the model of PAC-prediction with member- 
ship queries, we obtain results that are also valid for the many learning models 
lying in between, such as the PAC model [23], exact learning with equivalence 
and membership queries [1], or the PAE model [5], all of them among the most 
studied in computational learning theory. 

In our setting a concept c from a concept class C is merely a subset of a 
domain space X, paired with an encoding of c. A learning algorithm A has 
the goal of identifying a target concept t. It may make any number of queries 
or requests to a teacher or oracle, who is commited to answer according to 
t, although not necessarily in a collaborative way. In the Exact model with 
equivalence queries [1] the learner supplies the oracle with a hypothesis h and 
the oracle either says equivalent or returns a counterexample x G /i A t. If the 
hypothesis provided, h, does not belong to the concept class C then we speak 
of improper equivalence queries. We say that A learns a concept class C if for 
every target concept t, A halts with an output concept v equivalent to c. A runs 
in polynomial time if its running time is bounded by a polynomial on the size 
of the representation of t and the size of the largest counterexample. 

In the PAC-prediction with membership queries [3] the oracle also holds a 
probability distribution D on the space of examples X. The learner is allowed 
to make two kinds of oracle calls. In a membership query the learner supplies 
some X G X and the oracle answers “yes” if x G t and “no” otherwise. In a 
request for a random classified example the oracle returns a pair (x, b) where 
X is choosen according to D and 5 is 1 if x G t and 0 otherwise. The learning 
algorithm A must eventually make one request for an element to predict and 
halt with an output of 1 or 0. We say that A leans a concept class C if for every 
target concept t and for every distribution probability D on S', A predicts the 
correct classification of the requested example with probability at least 1 — e, for 
some accuracy bound e. We say that A runs in polynomial time if its running 
time is bounded by a polynomial on the size of the representation of t, the size 
of the largest counterexample, and 1/e. We shall say that a concept class C is 
polynomially learnable in any of the models if there exists a learning algorithm 
A that learns C and runs in polynomial time. 

We use families of formulas as concept clases. The concept represented by a 
formula is the set of all assignments (over the free variables of that satisfy 
the formula. We shall use throughout the paper the following observation: when 
El and F 2 are families of formulas such that Fi is polynomially contained in F 2 , 
if F 2 is polynomially learnable in any of the previous models then so is Fi . 
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2.3 Polymorphisms 

Every set of relations _T on a set A has a collection of associated operations 
on A, called the polymorphisms of F. We will use the polymorphisms of F 
in a crucial way to study F. An n-ary operation / preserves an rn-ary rela- 
tion R (or / is a polymorphism of R, or R is invariant under /) if, for any 
— (nil,... 7 ) 7 7 — (tiln 5 ■ ■ ■ 7 ^mn) ^ F 7 the tuple f (<ii , . . . , <in) de- 

fined as (/(an, . . . , ai„), . . . , /(omi, . . . , Omn)) belongs to R. If / preserves every 
relation from the set F, we say that / is a polymorphism of F. The set of all 
polymorphisms of F is denoted by Pol(T). 

Example 2. Let i?i be the 2-ary relation over the boolean domain A = {0, 1} 
that is given by 

i?i = {(0,l),(l,0),(l,l)} 

That is, Ri{x, y) is equivalent to xV y. 

Let m : {0, 1}^ — >■ {0, 1} be the ternary operation on {0, 1} that returns the 
majority of its arguments. That is, 

. , { x\i X = y 

mix, y, z) = < , , . 

' ^ ( z otherwise 

It is not difficult to verify that m is a polymorphism of i?i. We only need to 
check that for every three (not necessarily different) tuples in Ri, for example 
(0, 1), (0, 1), (1, 1), the tuple obtained by applying m component- wise, which in 
our example is (m(0, 0, 1), m(l, 1, 1)) = (0, 1), belongs to i?i. 

Indeed, it is easy to see that m is also a polymorphism of i ?2 and i ?3 where 
R 2 {x,y) =x\/y and Rz{x,y) = x\/y. Consequently, m is a polymorphism of 

F = {i?i, R 27 Rs}- 

The next theorem amounts to say that the learnability and non-learnability 
of a set of formulas can be expressed in terms of polymorphisms. 

Theorem 1 ([13]). Let F, Fq be sets of relations on a finite set A with Iq 
finite. Then: 

F //Pol(T) C Pol(/o) and 3-Form(T) is polynomially predictable with mem- 
bership queries then so is 3-Form(T’o)- 

2. //Pol(T) C Pol(Io) and V3-Form(T) is polynomially predictable with mem- 
bership queries then so is V3-Form(r'o)- 

In the same paper, it was shown that two types of polymorphisms guarantee 
the learnability of a class of formulas. An operation f : A^ ^ A with k > 3 is 
said to be a near-unanimity operation if 

/(a;, y,... ,y) = ■■■ = f{y, ... ,y,x)=y 

for all x,y G A. In this paper, it will be convenient for us to call such an operation 
generalized majority. The operation x ■ y~^ ■ z, where • and are the operations 
of a certain finite group, is called a coset generating operation. Note that every 
affine operation is a coset generating operation arising from an Abelian group. 
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Theorem 2 ([13]). Let F he a set of relations on a finite set A. If F has 
a polymorphism f such that f is either a near-unanimity or coset generating 
operation, then V3-Form(T) is polynomially exactly learnable with equivalence 
queries. 



Example 3. Every quantified 2-CNF formula can be expressed as a V3-Form(_T) 
formula where F is the basis of Example 2. Furthermore it is easy to observe 
that the operation m defined also in Example 2 is a near-unanimity operation. 
Consequently, we can infer that V3-Form(F), and hence the family of quantified 
2-CNFs, is polynomially exactly learnable with equivalence queries. 



Example 4- Let R be the solution space of a system of linear equations over a 
field F. Then the operation m{x, y, z) = x — y z is a, polymorphism of R. 
Indeed, let T • x = b be the system defining R, and x, y, z G R. Then 

A ■ m(x, y,z) = A-{x — y-\-z)=A-x — A-y-\-A-z = h. 

In fact, the converse can also be shown: if R is invariant under m then it is the 
solution space of a certain system of linear equations. 

The operation x — y z is called an affine opeartion. A similar operation 
X ■ y~^ ■ z, where • and are operations of a group is called a coset generating 
operation. If F is a 2-element field, then m{x,y,z) = x — y -\- z (and equal also 
X y z) is called a minority operation, because it satisfies the indentities 
m{x, X, y) = m{x, y, x) = m{y, x, x) = y. 

3 Main Result 

The class of operations we introduce here generalizes both near-unanimity and 
coset generating operations. 

Definition 2. An operation f : A^ — >■ A with k odd and k > 3 is a generalized 
majority-minority (GMM) operation if for all a,b G A, either 

(1) f{x, y, ■; y, y) = f{y, X, ..,y,y) = --- = f{y, y , .., y, x) = y, for x,y G {a, b} 



or 



(2) f{x, y,..,y,y) = f{y, x, ..,y,y) = ■ ■ ■ = f{y, y,..,y,x) = x, for x,y G {a, b}. 

A GMM operation f is called conservative if for any xi, . . . ,Xk, 
f{xi, ... ,Xk) G {xi, . . . ,Xk}. 

Let us fix a GMM operation on a set A. A pair a,bGAis said to be a majority 
pair if f on a, b satisfies (1). It is said to be a minority pair if / satisfies (2). 
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Theorem 3. Let A he a finite set and let F he a finite set of relations on A. 
T/ien V3 r-Form(T) is polynomially exactly learnahle with (improper) equivalence 
queries if and only if F is invariant under a conservative GMM operation. Other- 
wise 3 i.-Form(_T) is not polynomially predictable with membership queries under 
the assumption that public key encryption systems secure against chosen cipher- 
text attack exists. 

As it has been mentioned in Section 2, for each F, V3-Form(F) is polyno- 
mially contained in V3 r-Form(F'). Consequently, the positive learnability results 
also hold for the class of generalized quantified formulas, i.e., V3-Form(F) is 
polynomially exactly learnable if F is invariant under a conservative GMM op- 
eration. However, the negative learnability results do not hold in general for 
the class of generalized quantified formulas. We can however extend the nega- 
tive results to 3 -Form (F) by imposing some assumptions on F. Since for each 
F containing every unary relation, we have that 3 r-Form(F) is polynomially 
contained in 3-Form(F) we can infer that, for every such F, if 3-Form(F) is 
polynomially predictable with membership queries, then F must be invariant 
under a conservative GMM operation. In fact we are able to strenghten this re- 
sult by only requiring that the basis F contains all unary relations of cardinality 
at most 2. In this case we can remove the condition of conservativity of a GMM 
operation. 

Theorem 4. Let A he a finite set and let F be a set of relations containing 
all unary relations of cardinality at most 2 such that 3-Form(F) is polynomially 
predictable with membership queries. Lf there exists public key encryption systems 
secure against chosen ciphertext attack then F must be invariant under a GMM 
operation. 

Thus we have 

Corollary 1. Let A he a finite set and let F be a finite set of relations on A 
containing all unary relations of cardinality at most 2. Then V3-Form(F) is 
polynomially exactly learnahle with (improper) equivalence queries if and only if 
F is invariant under a GMM operation. Otherwise 3-Form(F) is not polynomi- 
ally predictable with membership queries under the assumption that public key 
encryption systems secure against chosen ciphertext attack exists. 

The positive learnability results are shown in Section 4. Due to space re- 
strictions we could not include the negative learnability results in the body of 
the paper. They can be found in the full version of the paper which can be 
downloaded from http : : /www . teen . upf . es/~vdalmau. 



4 Positive Learnability Results 

Let F be a set of relations on a finite set A and let f : A^ ^ A will be a GMM 
operation, k odd. In this section we shall show that if / is a polymorphism of F, 
then V3 r-Form(F) is polynomially learnable in the exact model with equivalence 
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queries. Our learning algorithm will make use of an alternative and compact 
representation for relatively quantified generalized formulas called signature. The 
equivalence queries and the output of the learning algorithm will be signatures. 

This section contains three subsections. In the first subection we introduce 
signatures and we show that for every relation concept expressed by a formula 
in V3r-Form(T) there exists a succint representation (of the size bounded by a 
polynomial in the arity of R) in the form of a signature. In the second subsection 
we prove that signatures are polynomially evaluable, that is, we can decide in 
polynomial time whether given a signature and a tuple a, whether a belongs 
to the relation represented by the signature. Finally, in the third subsection, 
we present an algorithm that polynomially learns signatures with equivalence 
queries. We want to stress out that all results in this section assume that the 
basis r under consideration has a GMM polymorphism /. 



4.1 Signatures 

Our approach relies in the following result. 

Lemma 1. Let F he a set of relations on a finite set A invariant under a GMM 
operation f . 

1. Every relation represented by a formula from V3-Form(F) is also invariant 
under f . 

2. If furthermore f is conservative, then every relation represented by a formula 
from V3 r-Form(F) is invariant under f 

Proof. We shall only sketch the proof of (1) (the proof of (2) is similar). We 
shall proof that for every formula in V3-Form(F), is invariant under /. It 
is proven by structural induction on F (see also [13,4]). The only non entirely 
straightforward case is when F is obtained by means of universal quantification 
= \/x<L. In this case we use the fact, stated in the proof of Proposition 1, that 
is equivalent to 



(3x <P A a'i(x)) A • • • A (3x A a(,(a;)), 

where {ai, . . . , o^} are the elements of A and a(, 1 < i < r is the unary relation 
{(oi)}. It is worth mentioning that (1) holds for every function / satisfying 
f{x, X, . . . ,x) = X and that (2) holds for every conservative / (not merely if / is 
a GMM operation). □ 

Gonsequently from now on we shall view concepts as relations invariant under 
/. Signatures will be a compact representation of such relations. 

In what follows, A (the domain), / (the GMM operation on A), and k (its 
arity) are fixed. A signature of arity n is a triple (Pro, Rec, Wit) where 

— Pro is a collection of pairs (I, a) where / is a subset of [n] of cardinality at 
most k — 1 and a is a tuple on A whose arity matches the cardinality of I. 
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— Rec is a collection of triples {i, a, b) where 1 < i < n and a,b G A is a 
minority pair. 

— Wit is a collection of n-ary tuples on A. 

Furthermore Pro, Rec and Wit must satisfy the following conditions: 

— For every (/, a) G Pro, Wit contains some tuple x such that pr/X = a. 

— For every (i,a,b) G Rec, Wit contains some tuples a,b G i? such that 
Pr{i,....*-i} a = Pr{i,....»-i} b, pr^ a = a, and pr, h = b. 

— |Wit| < |Pro| + |Recj 

We shall denote by Sig„ the set of all signatures of arity n and by Sig the set 

Un>0 

Let R be an n-ary relation and let (Pro, Rec, Wit) be a signature. We say 
that (Pro, Rec, Wit) represents R ((Pro, Rec, Wit) is a signature of R) if the three 
following conditions are satisfied: 

— For every I < k' < k, for every I C [n] such that |/| < — 1 and for every 

{k' — l)-ary typle a on A, (/, a) G Pro if and only if a G prj R. 

— For every 1 < i < n and for every minority pair a,b G A, (i, a, b) G Rec if and 

only if there are a, b G i? such that pr^i a = pr{i_ j_i} b, pr^a = a, 

and prj h = b. 

— Wit is contained in R. 

For a set of tuples T, we denote by (T) the smallest relation invariant under 
/ and containing T. We refer to this relation as to the relation generated by T. 

Theorem 5. Let R and S be n-ary relations on A invariant under f with signa- 
tures (Pro/f, Recfl, Witfl) and (Pros, Recs, Wits). If the signatures are identical 
then R = S. 

Proof. We start with an auxiliary statement. 

Claim. If S' C i? then R = S. 

We shall show that R C S. In particular, we shall show by induction on 
i that pr{i R C pr^^....,*} Take i<n. If i<k— 1 then the required 

inclusion easily follows from Pro^^ = Pros. So let t > fc and R' = pr^j^ R, 
S' = pr{i q S and a = (oi, . . . ,Ui) G R' . By induction hypothesis, for some bi, 
the tuple (oi, . . . , ai-i,bi) = a' belongs to S'. 

We consider two cases. 

Case 1. {ai,bi} is majority. 

In this case we show that, for every I C {I, . . . , f}, prj a G pr^ S'. We show 
it by induction on the cardinality m of I. The result is true for m < k—1 due to 
the fact that Pro/j = Pros. It is also true for every set I that does not contain 
i, since a' G S' certifies it. Thus let / = {ji, ■ . ■ ,jm} be any set of indices 1 < 
ji < J 2 < • • • < jm = * with m > k and also let S" = prj S' and a* = pr^ a. To 
simplify the notation let us denote a* = (ci, . . . , Cm). By induction hypothesis, 
S" contains the tuples di = (di, C 2 , . . . , Cm), d 2 = (ci, ^ 2 , cs, . . . , Cm), • • • , dm = 
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(ci, . . . , Cm-i, dm) for some d\, . ■ ■ ,dm G A. If for some i, Ci = di then we are 
done. Otherwise, we can assume that dm = h and Cm = di and henceforth 
{dm, Cm} is majority. If for some j, the pair {dj,Cj} is minority then we are 
done, because /(dj, dj, . . . , dj, d^) = a*. Otherwise, {dj,Cj} is majority for any 
j. In this case we have /(di, . . . , dk) = a*. 

Case 2. {ai,bi} is minority. 

Since a' G S' and S' C _R, we have a' G R' . Furthermore, since a G R' , we 
can conclude that (i,ai,bi) belongs to Rec_R and hence to Recs. Consequently, 
there are ci, . . . , Cj_i, such that c = (ci, . . . , Ci_i, Oi) and c' = (ci, . . . , Ci_i, bi) 
are in S'. The tuples e = (di, . . . , di_i, a^) = /(a', a', . . . , a', c) and d' = 
{di, . ■ ■ ,di-i,bi) = Z(a',a', . . . , a', c'), belong to S'. If {cj,aj} is majority for 
a certain j, then dj = aj. Therefore, if dj yf aj then {dj,aj} is minority. Thus 
we have /(d, d', . . . , d', a') = a belongs to S' and we are done. 

Let T = (Witfl) (and = (Witg)). Observe that (Pros, Recs, Wits) is a signa- 
ture of T. Since T C S and T and S have the same signature, we have S = T. 
Similarly R = T and we are done. □ 



Consequently, a signature represents at most one relation invariant under / 
(it is possible however that a given signature does not represent any relation). 
Notice that the size of Pro and Rec (and hence of Wit) is always bounded by a 
polynomial in n (recall that / and hence k and A are fixed). 



4.2 Signatures Are Polynomially Evaluable 

Our learning algorithm will use signatures as its concept class. That is, both the 
equivalence queries and the output of the algorithm will be signatures. Conse- 
quently, in order to make sure that we can use the output of our algorithm in 
order to classify examples we must show that signatures are polynomially evalu- 
able, i.e, we can decide in polynomial time, given a tuple a and a signature s, 
whether the tuple a belongs to the relation represented by the signature s. 

Let R be an n-ary relation, let s = (Pro/j, Recn, Wit_n) be a signature of R, 
and let oi, . . . , am G A, m < n. By Ra^,... ,am denote the relation 

{(6l, . . . , bn—m) I (oi , . . . , dull ^1; • ■ • ; ^n — m )GR} 

A signature of i?ai,...,om can be efficiently computed as the following 

theorem shows. 

Theorem 6. Let R be an n-ary relation invariant under f and oi, . . . , am G A. 
There exists an algorithm that, given a signature s = (Proj^, Rec/j, Wit/j) of R 
computes a signature of Ra^,... ,a,n with running time polynomial on n. 

Proof. Let d be any element in the universe A. We shall show an algorithm that 
contracts a signature (Proiy^, Rec/j^ , Wit^^^ ) of Rd- 




376 



A. Bulatov, H. Chen, and V. Dalmau 



The algorithm has two phases. In the first phase, it computes Pro/j^ and the 
elements of Wit/j^ witnessing the tuples from Pro/j^. Let I be any set containing 
at most k' — l, k' <k indices and let a be a k'-dxy tuple on A. Let J = {l}U{t+l : 
i G 1} and S = (pr j Wit^); note that S' is a relation of arity at most k. It is not 
difficult to see that S can be computed efficiently since the maximum possible 
total number of tuples in S is \A\^ . If (d, a) ^ S, then we do not include (/, a) 
in Pro/j^. Otherwise, we can find easily a tuple b in the relation generated by 
Witfl such that pr j b = (d, a). We include pr{ 2 ,,,.,n} t> in Witij^ and we include 
(/,a) in ProR^ 

In the second phase the algorithm computes and the elements of Wit/j^ 

witnessing the tuples from Rec^^. If (i+ l,a, 6) ^ Rec/j then (i,a,b) ^ Rcc_r^. 
Otherwise we do the following. Let / = {f} be the set of indices containing 
exactly i. If (/, a) ^ Pro/j^ then (i, a, b) ^ Rec/j^. Otherwise, we include (i, a, b) in 
PecR^ . In order to find a couple of tuples that witness (i , a, b) we do the following: 
Let e' G Witi,;^ be the tuple with a = pr j e' that witnesses (/, a) . The tuple e = 
(d, e') belongs to R, pr^^ e = d, and pr^,,.]^ e = a. Since (f + 1, a, b) G Recjj, there 
are a', b' G \N\tR C R such that pr^ a' = pr^ b', prj_|_;^ a' = a, and prj_,_i b' = b. If 
prj^ a' = d then we are done. So, we assume that pr^^ a! ^ d and consider two cases 
depending on whether {prj^ a', d} is majority or minority. If {prj^ a', d} is majority 
then we set a* = /(e, e, . . . , e, a') and we set b* = /(e, e, . . . , e, b'). It is not hard 
to see that a*, b* are in R, pr^ a* = pr^ b* = d and pr^i a* = pr^i _j} b*. 
Since {a, 6} is minority, we have that pr^.,.]^ a* = a and prj_|_]^ b* = b. If {pr^^ a', d} 
is minority then we set a* = f{e,a',...,a',a') and b* = /(e,a', . . . ,a',b'). 
Finally, we include pr 2 ^..._„a*,pr 2 ^ „ b* into Wit/j^. 

It is easy to see that the running time of the algorithm is polynomial in n. 

Let us prove that the algorithm contracts the right sets. It is straightforward 
to see that Wit^j^ C Rj^. Let us start with Pro/j^. It suffices to show that, for 
every collection of indexes I of size k' — 1, k' < k and any {k' — l)-tuple a the 
pair (/, a) belongs to Proi?^ if and only if a G prji?^. To see this, first observe 
that a G prj Rj_ if and only if (d, a) G prj R. By Theorem 5, R= (Witj^). Hence, 
pr J R = (pr J Wit/{) and the required property follows from the construction of 
Profl,. 

Then we consider Rbcr^ . In this case we have to show that there exists some 
di, . . . , di-i such that (di, . . . , di_i, a), (di, . . . , di_i, 5) G pr^i q Rd if and only 
if (z, a, b) G Rec/j^. If such tuples exist then (d, di, . . . , di_i, a), (d, di, . . . , di_i, 6) 
G pr{i,.,, ,i+i} R and (z+ 1, a, b) G Rec/j. Then there is c G such that pr^ c = a. 
Therefore ({z},a) G Pro/j and consequently the algorithm includes (z,a, 6) into 
Rec/jj in phase 2. Conversely, if (z,a, 6) G Rec/j^ then, for some di, . . .di_i, we 
get (d, di, . . . , di-i, a), (d, di, . . . , di_i, b) G pr{i,,,. y+i} R and, therefore we have 
(di, . . . , di_i, a), (di, . . . , dj_i, 6) G pr{i_,„j} Rd- 

By iterating the process we can obtain for any given oi, . . . , o; G A, a signa- 
ture of Rai,...,ai — (■ ■ ■ {kdai ) 0 . 2 ) ■ ■ ■) 0,1 • C 

By convention there exists one tuple of arity 0, denoted A. Consequently 
there are only two relations of arity 0, the empty relation 0 and the relation 
{A}. This enables us to deal with signatures of relations of arity 0 in a natural 
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way. If i? is 0 then Pro(i?) = Rec(i?) = W\t{R) = 0 whereas if R is {A} then 
Pro(i?) = {(0,A)}, Rec{R) = 0, and Wit(i?) = {A}. 

Corollary 2. Let R be an n-ary relation invariant under f. There exists an 
algorithm that, given a signature of R and a tuple a = (ai, . . . , a„) € A", decides 
whether a € R with running time polynomial in n. 

Proof. To check if a given tuple (oi, . . . ,a„) belongs to R, it suffices to check 
whether the set Wit(i?aj contains A. □ 

It is important to notice here that the algorithm which certifies that sig- 
natures are polynomially evaluable does not necessary succeed when the sig- 
nature that receives as input does not represent a relation. However, for every 
n-ary signature (Pro, Rec, Wit) and every oi, . . . we can define the signature 
(PrOai,,,.,amJ Rs<^oi,...,am 7 the output of the algorithm introduced 

in the proof of Theorem 6 regardless on whether (Pro, Rec, Wit) is the signature 
of a relation or not. The reason for this is that we are interested in regarding 
the set of all signatures as our representation class. In this case, we will assume 
that the relation represented by a signature s consists of all tuples a for which 
the algorithm with input s and a outputs “yes”. It is easy to observe that the 
relation represented by a signature (Pro, Rec, Wit) is allways a subset of (Wit). 



Step 1. 
Step 2. 

Step 2.1 

Step 2.1.1 
Step 2.2 



Step 2.2.1 
Step 2.2. 



Step 3. 



set Pro, Rec, Wit 0 

while £lQ((Pro, Rec, Wit))= ’no’ do 

let a = (ai, . . . ,a„) be the counterexample produced by EQ 
if for some 7 = {ii, . . . , ii < 12 < ■ ■ ■ < iy-i- k' < k, the pair 

(7, pTj a) ^ Pro then 

set Pro ;= Pro U (7, pr^ a) and Wit := Wit U {a} 
else 

For each 0<i<n compute (PrOoi,...,ai , ReCai,...,ai , Witaj,,.,,aj). 
Let i be the smallest integer such 

that A 0 Witai,...,ai (such i must exist if a does not belong 
to the relation represented by (Pro, Rec, Wit)) 
pick an element b = (6;, bi+i, ...,b„)G Witai,...,ai_i 
({oijbij must be a minority pair by Theorem 5) 
set Rec := Rec U {{i, ai, 6i)} and 
Wit — Wit U {a, (fli, . . . ,ai-i,bi, bi+i, . . . ,6„)} 
endwhile 

return Pro, Rec, Wit 



Fig. 1. Algorithm learning signatures with equivalence queries 




378 



A. Bulatov, H. Chen, and V. Dalmau 



4.3 Signatures Are Polynomially Learnable with Equivalence 
Queries 

Making use of the algorithm presented in Theorem 6, it is not difficult to find 
an algorithm that learns signatures with equivalence queries. The algorithm is 
presented in Figure 1. 

The target concept of the algorithm is an n-ary relation invariant under /. 
Recall that /, k and A are fixed. At any stage of the execution of the algorithm 
the relation represented by (Pro, Rec, Wit) is a subset of (Wit) C R. This ensures 
that the algorithm is monotonic. Furthermore at every round of the algorithm, 
either Pro or Rec increases its size. Since Pro and Rec can have at most a poly- 
nomial number of elements we can ensure that the algorithm ends in a number 
of steps polynomial in n. 

Theoreui 7. The class Sig is polynomially learnable with equivalence queries. 

Corollary 3. Let A he a finite set and let T be a finite set of relation on A. 

— If r is invariant under a GMM operation then V3-Form(T) is polynomially 
learnable with equivalence queries in Sig. 

— If r is invariant under a conservative GMM operation then V3r-Form(T) is 
polynomially learnable with equivalence queries in Sig. 
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Abstract. The Elementary Formal Systems {EFSs, for short) are origi- 
nally introdnced by Smullyan to develop his recnrsion theory. In a word, 
EFSs are a kind of logic programs which use strings instead of terms 
in first order logic, and they are shown to be natural devices to define 
languages. 

In defining languages of EFSs, many studies assume that substitutions 
are nonerasing. A substitution or a nonerasing substitution is defined as 
a homomorphism from nonempty strings to nonempty strings that maps 
each constant symbol to itself, although an erasing substitution allows 
mapping of variables to the empty string. 

In this paper, we investigate the learnability of language classes gener- 
ated by simple formal systems {SFSs, for short) as well as regular formal 
systems {RFSs, for short). We show that the learnability of some classes 
of their languages varies according to whether or not erasing substitu- 
tions are allowed. Then we present a subclass of RFSs such that the 
corresponding class of languages is inferable in the limit from positive 
examples, even when erasing substitutions are allowed. 

We also apply the obtained result to the learning problem of languages 
generated by simple H systems. The simple H systems {SH systems, for 
short) are introduced by Mateescu et al. by modeling the recombinant 
behavior of DNA sequences. We show that an SH system can be natu- 
rally converted into an RFS whose language coincides with the language 
generated by the original SH system. By using this conversion, we show 
that the class of languages generated by a certain subclass of SH systems 
is inferable in the limit from positive examples. 



1 Introduction 

The Elementary Formal Systems {EFSs, for short) are originally introduced by 
Smullyan [23] to develop his recursion theory. In a word, EFSs are a kind of 
logic programs which use strings instead of terms in first order logic (Yamamoto 
[25]), and they are shown to be natural devices to define languages (Arikawa 
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[3] and Arikawa et al. [4]). Especially Arikawa et al. [4] presented various sub- 
classes of EFSs whose language classes correspond to the four classes of Chomsky 
hierarchy. 

By taking advantage of structural merits and flexibility of logic programs, 
many researchers discussed learning problems in the framework of EFSs (Arikawa 
et al. [4,5], Sakamoto et al. [18], Lange et al. [12] and so on). Especially, in the 
learning paradigm of identification in the limit due to Gold [8], Shinohara [21] 
showed that the language class generated by the so-called length-bounded EFSs 
with at most k axioms is inferable in the limit from positive examples for each 
fc > 0. Angluin [1] introduced a pattern and its language and showed that the 
class of pattern languages is inferable in the limit from positive examples. The 
class of pattern languages is regarded as the class of languages generated by 
EFSs with just one axiom. 

In defining models and languages of EFSs, many studies assume that sub- 
stitutions are nonerasing. A substitution or a nonerasing substitution is defined 
as a homomorphism from nonempty strings to nonempty strings that maps each 
constant symbol to itself, although an erasing substitution allows mapping of 
variables to the empty string. However, in applying patterns or EFSs to real 
problems, there are many cases that erasing substitutions are more suitable 
(Arimura et al. [6], Shinohara [19] and so on). We call a language generated 
by an EES using nonerasing substitutions (resp., erasing substitutions) an EPS 
language (resp., an erasing EPS language). Then an erasing EES language gen- 
erated by an EES with just one axiom is called an extended pattern language. It 
is unknown whether or not the class of extended pattern languages is inferable 
in the limit from positive examples, although Reidenbach [17] showed that that 
class with just two constant symbols is not inferable in the limit from positive ex- 
amples. Concerning erasing EES languages, Uemura and Sato [24] showed that 
the class of the so-called erasing PFS languages is inferable in the limit from 
positive examples under a certain condition. 

In this paper, we investigate the learnability of language classes generated by 
simple formal systems {SFSs, for short) as well as regular formal systems {RFSs, 
for short). We show that the learnability of their languages varies according to 
whether or not erasing substitutions are allowed. That is, we show that the class 
of EES languages generated by SFSs or RFSs with at most k axioms is inferable 
in the limit from positive examples, but so is not the corresponding class of 
erasing EES languages. 

We also apply the obtained result to the learning problem of languages gen- 
erated by simple H systems. Simple H systems {SH systems, for short) are in- 
troduced by Mateescu et al. [14] by modeling the recombinant behavior of DNA 
sequences. We show that an SH system can be naturally converted into an RES 
whose erasing EES language coincides with the language generated by the orig- 
inal SH system. By using this conversion, we show that the class of languages 
generated by SH systems with at most k segments is inferable in the limit from 
positive examples for each k > 0. Head [10] showed that the class of SH lan- 
guages becomes a subclass of strictly locally testable languages. The obtained 
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result presents a learnable subclass of strictly locally testable languages (Yoko- 
mori et al. [26], Garchia [7] and so on). 

2 Preliminaries 

In this section, we summarize basic notations and results necessary for our dis- 
cussion. 

Let if be a fixed finite alphabet. Each element of S is called a constant 
symbol. Let E* be the set of constant strings over E and let E~^ be that of 
nonempty constant strings over E. Let us denote by e the empty string, i.e., the 
string with length 0. Then E^ = E* \ {e} holds. A subset L of E* is called a 
language over E. For a string w € E* , we denote by jwj the length of w. For two 
strings u,v € E* , we denote by mu or u • u the concatenation of u and v. 

In this paper, [[S' denotes the cardinality of a set S, and N denotes the set 
of nonnegative integers. 

Definition 1 (Angluin [2]). A class C = {LijigAr of languages is said to be 
an indexed family of recursive languages, if there is a total computable function 
f : N X E* ^ {0,1} such that f{i,w) = 1 if and only if w € Li. 

Since we exclusively deal with an indexed family of recursive languages as a 
class of languages, we simply call it a class of languages without any notice. 

Definition 2 (Gold [8]). A positive presentation, or a text, of a language 
L C E* is an infinite sequence wq,wi, - ■ ■ G E* such that {wi \ i >0} = L. 

In what follows, a or S denotes a positive presentation, and cr[n] denotes the 
a ’s initial segment of length n > 0. 

An inductive inference machine (IIM, for short) is an effective procedure, or 
a certain type of Turing machine, which requests inputs from time to time and 
produces indices from time to time. The outputs produced by the machine are 
called guesses. 

For an IIM M , for a positive presentation a and for n > 0, by M{a[n]), we 
denote the last guess produced by M which is successively presented examples in 
a[n] on its input requests. 

An IIM M is said to converge to an index i for a positive presentation a , if 
there is an n > 0 such that for every m > n, M{a[m\) = i. 

Then we define the learnability of a class of languages as follows: 

Definition 3 (Gold [8] and Angluin [2]). Let C = {LijigAr be a class of 
languages. 

An IIM M is said to infer a language Li £ C in the limit from positive 
examples, if for every positive presentation a of Li, M converges to an index j 
for a such that Lj = Li. 

An IIM M is said to infer a class C in the limit from positive examples, if 
for every Li € C, M infers Li in the limit from positive examples. 

A class C is said to be inferable in the limit from positive examples, if there 
is an IIM which infers C in the limit from positive examples. 
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Angluin [2] presented a theorem characterizing for a class to be inferable in 
the limit from positive examples. 

Definition 4 (Angluin [2]). A set T of strings is a finite tell-tale set of a 
language L within a class L, if (1) T is a finite subset of L and (2) there is no 
Li € C such that T C Li C L. 



Theorem 1 (Angluin [2]). A class £ is inferable in the limit from positive 
examples, if and only if there is an effective procedure which on input i > 0 
enumerates a finite tell-tale set of Li € C within C. 

We note that if a class £ with some specific indexing is inferable in the limit 
from positive examples, so is the class with arbitrary indexing (cf., e.g., Lange 
and Zeugmann [11]). Additionally, we discuss inferability of each class using set- 
theoretic property of the class which is independent of indexing. Thus, in the 
present paper, we fix some suitable indexing for each class and do not pay much 
attention on indexing of the class. 

By the theorem above, we see that the following lemma is valid: 

Lemma 1. Let L be a class of languages. 

Then if there exists a sequence Li,^,Li^, - ■ ■ € £ and a language Li € C such 
that 

£zq $: * * * and — Li, 

n>0 

then £ is not inferable in the limit from positive examples. 

A class is called super-finite, if it contains every finite language and at least 
one infinite language. We note that each super- finite class satisfies the premise in 
the above lemma, and thus it is not inferable in the limit from positive examples 
(Gold [8]). 

Wright [27] and Motoki et al. [15] showed a sufficient condition for a class to 
be inferable in the limit from positive examples. 

Definition 5 (Wright [27] and Motoki et al. [15]). A class £ is said to 
have infinite elasticity, if there are two infinite sequences Li.^ , Li^ , • • • € £ and 
Wo, Wi, ■ ■ ■ € S* such that for every n > 1, 

{wO, Wl, • • • ,W^n-l} C but Wn^Li^. 

A class £ is said to have finite elasticity, if C does not have infinite elasticity. 

By definition, it is easy to see that if a class £ has a finite elasticity, then so 
does every subclass of £. 

Theorem 2 (Wright [27] and Motoki et al. [15]). A class £ is inferable in 
the limit from positive examples, if £ has finite elasticity. 
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3 Elementary Formal Systems and Their Languages 

3.1 Definitions and Basic Properties 

In this paper, we briefly recall EFSs and their languages. For detailed deflnitions 
and properties of EFSs, please refer to Smullyan [23], Arikawa [3], Arikawa et 
al. [4] and Yamamoto [25] . 

Let A be a finite alphabet. Let X and II be nonempty sets. Elements in 
X and n are called variables and predicate symbols, respectively. We assume 
that A, X and II are mutually disjoint. By x, y, xi,X 2 , • • •, we denote variables, 
and by p,q,pi,p 2 ,- ■ we denote predicate symbols. Each predicate symbol is 
associated with a positive integer which we call an arity. 

Definition 6. A term or a pattern is a possibly empty string over A U A. By 
7T, 7Ti, 7T2, • • •, we denote terms. A term tt is said to be ground, if n does not 
contain any variable. By w,wi,W 2 , ■ ■ ■ , we denote ground terms. 

An atomic formula (atom, for short) is an expression of the form 
p(7Ti, • • • , 7T„), where n > 1, p € II is a predicate symbol with arity n, 
and 7ri,---,7T„ are terms. By A, B, Ai, A 2 , ■ ■ ■, we denote atoms. An atom 
p{tti, ■ ■ ■ ,TTn) is said to be ground, z/ tti, • • • , 7t„ are ground terms. We define 
the length of an atom p{tti, • • • , 7r„), denoted by |p(7Ti, • • • , 7t„)|, as X)i<i<n K*l- 

We define well- formed formulas and clauses in the ordinary ways [13]. 
Definition 7. A definite clause is a clause of the form 

A -(r- B\, - ■ ■ , Bn , 

where n > 0, and A,Bi, - ■ ■ , Bn are atoms. In case, A,Bi,---, B„ are all ground 
atoms, then we call the above clause a ground clause. The atom A above is 
called the head of the clause, and the sequence i?i, • • • , is called the body of 
the clause. By C,D,Ci,C 2 , - ■ ■ , we denote definite clauses. Then an EES is a 
finite set of definite clauses, each of which is called an axiom. 

A substitution is a homomorphism from terms to terms which maps each 
symbol a G S to itself and each variable x G X to nonempty term in (A U A)+. 
In case we allow mapping of a variable to the empty string e, we call such 
substitutions erasing substitutions. By 9, S, 9\, 62 , ■ ■ ■, we denote substitutions. 

For a term tt, the image of tt by a substitution 9 is denoted by tt 9. For a 
predicate symbol p G II with arity n and for terms tti, • • • , 7t„, the image of an 
atom A = p{tti, • • • , 7t„) by a substitution 9 is defined as A9 = p{tti, • • • , 7t„)0 = 
p{tti9, • • • , tt„ 9). Furthermore for atoms A,Bi,---, B„, the image of a clause C = 
A ^ Bi, - ■ ■ , Bn by a substitution 9 is defined as C9 = (A ^ B\, - ■ ■ , Bn)9 = 
A9^ Bi9,---,Bn9. 

Then for an EES, we define its language as follows: 

Definition 8. Let F be an EFS. We define the relation A h C for a clause C 
inductively as follows: 
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(1) If C er, thenThC. 

(2) If r\- C, then F \- CO for every substitution 9. 

(3) If r h A-<r- Bi, - ■ ■ , Bn+i and F h Bn+i then F \- A B\, - ■ ■ , Bn ■ 
We also define the relation F \-g C similarly but allowing erasing substitutions 

in (2). Therefore F \- C implies F \-g C , but the converse is not always true. 

For an EFS F and for a unary predicate symbol p G II , we define the lan- 
guages L{F,p) and Lg(F,p) as follows: 

L{F,p) = {w G S* \ F \- p{w) and Lg{F,p) = {w G S* \ F \-g p{w) ^}. 

A language L is called an EFS language (resp., an erasing EFS language^, 
if there is an EFS F and a unary predicate symbol p G II such that L = L{F,p) 
(resp., L = Ls{F,p)). 

Arikawa et al. [4] has introduced subclasses of EFSs whose language classes 
correspond to the four classes of Chomsky hierarchy. In the present paper, we 
concentrate our attention to the subclasses of EFSs called simple formal systems 
(Arikawa [3]) and regular formal systems (Shinohara [20]) defined as follows: 

Definition 9. An EFS F is called a simple formal system (an SFS, for short), 
if every axiom of F is of the form 

p(7t) ^ qi{xi),---,qn{xn), 

where n > 0, p, qi , • • • , G IF are unary predicate symbols, tt is a term, and 

xi, - ■ ■ ,Xn are mutually distinct variables appearing in tt. 

A term tt is called regular, if every variable appears at most once in tt. An 
SFS F is called a regular formal system (an RES, for short), if every term 
appearing in F is regular. 

A language L is called an SFS language (resp., an erasing SFS language^, if 
L is an EFS language (resp., an erasing EFS language) generated by an SFS. 
We also define an RFS language and an erasing RFS language similarly. 

We denote by STS the class of SFSs, by STS L that of SFS languages, and by 
eSTSC that of erasing SFS languages. We also define the classes IZTS, IZTSL 
and elZTSL similarly. 

Arikawa [3] and Arikawa et al. [4] showed the powers of SFSs and RFSs: 

Theorem 3 (Arikawa [3] and Arikawa et al. [4]). The following relations 
hold: 

CTC = IZTSC C STSC C CSC, 

where CTC and CSC represent the class of context-free languages and that of 
context-sensitive languages, respectively. 

Furthermore, on erasing SFS or RFS languages, we have the following theo- 



rem: 
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Theorem 4. The following relations hold: 

(1) eSTSC — STSC holds. 

(2) eTlTST = TIT ST holds. 

Proof. (I) Let L G eSTSC. Then there is an SFS P and a unary predicate 
symbol p € II such that L = Lg {P, p) . 

For a clause C, let us put var(C') the set of all variables appearing in C. For 
a set Y of variables, we put Os{Y) = {x := e \ x & Y}, that is, 9e{Y) is the 
erasing substitution that maps every variable x G Y to the empty string e. 

Then let us put P' = {C9s(Y) \ C G P, Y C var(C)}. We note that P' is an 
EFS but may not be an SFS, because P' may contain a clause whose body has 
an atom of the form q{e). 

Let P” = P' . Then we delete from P" every clause C whose body contains 
an atom q{e) with P' Y q{e) Finally, we delete all atoms of the form q{e) 
appearing in the bodies of the clauses in P”. Then P” is an SFS such that 
L = Le{P,p) = L{P",p), and thus L G STSC holds. 

This means that eSTSC C STSC. 

(II) Let L G STSC. Then there is an SFS P and a unary predicate symbol 
p G n such that L = L{P,p). 

First, we put 

{ C is obtained from some C G P by replacing every 
C' variable x that appears only in the head of C with 
ax for some a G S. 

Let pi, - ■ ■ ,pn be all the predicate symbols such that P h Pi{e) Then we 
define Li’s (1 < i < n) as follows: 

C' is obtained from some C G Pi-i by delet- 
ing the atom Pi(x) in the body of C and 
deleting every appearance of variable x in the 
head of C, where x is the variable appearing 
in the body associated with pi. 

Then P„ is an SFS such that L = L{P,p) = Le{PmP), and thus L G eSTSC 
holds. 

This means that STSC C eSTSC. 

By (I) and (II), we conclude that STSC = eSTSC is valid. The above proof 
is also a proof for TZTSC = elZTSC. □ 

3.2 Learnability of EFS Languages 

In the present paper, we consider learnability of languages generated by SFSs, 
RFSs and SH systems. Since these systems are recursively enumerable and their 
languages are recursive, we can regard these language classes as indexed families 
of recursive languages. 

As easily seen, the classes STSC, TZTSC, eSTSC and eTZTSC are super- 
finite, and thus we have the following theorem: 
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Theorem 5. The classes STSL, TZTSL, eSTSL and eTZTSL are not inferable 
in the limit from positive examples. 

For each A: > 0, we define the subclass S!FS-^ of STS as the class of SFSs 
with at most k axioms. Then we denote by STSC-^ the class of SFS languages 
generated by SFSs in STS-^ and by eSTSC-^ that of erasing SFS languages 
generated by SFSs in STS~^ . We also define the classes TZTS-^, TZTSC-^ and 
eTZTSC-^ similarly. 

Shinohara [21,22] showed that for each k>0, the class of languages generated 
by the so-called length-bounded EFSs with at most k axioms has finite elasticity, 
and thus it is inferable in the limit from positive examples. Since STSC-^ and 
TZTSC-^ are subclasses of the class, we see that the following theorem holds: 

Theorem 6. Let k > 0. 

Then the classes STSC- and TZTSC- have finite elasticity, and thus they 
are inferable in the limit from positive examples. 

In the rest of this section, we consider learnability of the classes eSTSC-^ 
and e'RTSC-'^. 

Theorem 7. Let ttif > 1 and let k >3. 

Then the classes eTZTSC-^ and eSTSC-^ are not inferable in the limit from 
positive examples. 

Proof. We consider the following RFSs: 



p{xi ■■■x„) ^ q{xi), ■■■ , q{xn); 
Tn = { q{a) 

. <?(e) ^ 

f p{axi) ^p{xi)\ 

I P(e) ^ 



(n > 0), 



r = 

oo 



Then, as easily seen. 



Ls{Tn,p) = {a* I 0 < i < n} (n > 0), 

Le{roo,p) = {a* I * > 0} 



hold, and thus 

Le{ro,p) C Ls{Ti,p) C 



and — L^(^Tq,q,p^ 



n>0 



hold. Therefore, by Lemma 1, the class eTZTSC-^ is not inferable in the limit 
from positive examples. 

Since the above EFSs are also SFSs, we see that the class eSTSC-^ is also 
not inferable in the limit from positive examples. □ 
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We note that Shinohara [19] showed that the class of the so-called extended 
regular pattern languages, which coincides with eTZTSC-^, is inferable in the 
limit from positive examples. On the other hand, Uemura and Sato [24] showed 
that the class of the so-called erasing primitive formal languages, which is a 
subclass of eTZTSC-^ , is inferable in the limit from positive examples under a 
certain condition. 

Furthermore, the following example shows that the class of infinite languages 
in eTZTSC-^ with fc > 5 and the class eSTSC-^ with k>2 are also not inferable 
in the limit from positive examples. 

Example 1. (1) Let k>5, let > 2, and let a, 6 G 27 be two distinct constant 
symbols. 

We consider the following RFSs: 

' r{xiX2b) r{xi),p{x2); 

En = I P{XI ■■■Xn) q{Xl), ■■■ , q{Xn); (^^ > 0), 

q{a) 

> ^ . 
r(xiX2b) ^ r(xi),p(x2); 

°° \ p{axi) ^ p{xi); 

. P{^) ^ 

Then, as easily seen, 

Lg^En, r) = • • • a*™ 6 | m > 0, 0 < ii, Z 2 , • • • , > 0), 

Le{Eoo,r) = • • a^'^b | m > 0, Zi, Z 2 , • ‘ ‘ > 0} 

hold, and thus by Lemma 1, the class of infinite languages in eTZTSC-^ is not 
inferable in the limit from positive examples. 

(2) Let k >2 and let [[27 > 2. 

We consider the following SFSs: 

En = {p{xiX2 ■ ■ ■ XnXn ' ' ' X 2 X 1 ) (u > 0 ), 

_ J p{xiX2Xi) ^ p{x2); \ 

- \p{x,xi)^ /■ 

For each n > 0, 

p{xiX2 ■ ■ ■ XnX„+iXn+lXn ■ ■ ■ X2Xi){Xn+l ■= s} = p{xiX2 ■ ■ ■ • • • X 2 X 1 ) 

holds, and thus we have Lg{En,p) C Lg{rn+i,p). 

Let a,b G 27 be two distinct constant symbols. For each n > 0, we put 
w„ = ab^ and = wqWi ■ ■ ■ WnW„ ■ ■ ■ wiWq. Then, as easily seen, for each 
n > 0, G Lg{rn+i,p) \ Lg{En,p) holds. Thus it is easy to see that these 

languages satisfy the conditions of Lemma 1. Therefore the class eSTSC-^ is 
not inferable in the limit from positive examples. 
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Now, we introduce a subclass of eTZTSC-^ that is inferable in the limit from 
positive examples. 

Definition 10. A regular term tt is called canonical, z/tt = wqXiWiX 2 ■ ■ ■ XnWn 
for some n > 0, wq, Wn G S* and wi, - ■ ■ , Wn - 1 G . 

An RFS r is called canonical, if all regular terms appearing in F are canoni- 
cal. We denote by CTZTS the class of canonical RFSs, by CTZTSL that of canon- 
ical RFS languages, and by eCTZTSL that of erasing canonical RFS languages. 
Furthermore, for each k > 0, we define CTZiFS-^ , CTZiFSC-^ and eCTZiFSC-^ 
similarly to , UTSC-'^ and e'RTSC-'^. 

Shinohara [19] showed that the so-called extended pattern language definable 
by an arbitrary regular pattern is also definable by a canonical regular pattern. 

Lemma 2. Let tt be a term and let t be a canonical regular term. 

Then if tt = t 9 for some erasing substitution 0, then |r| < 2|c(7r)| -|- 1 holds, 
where c{tt) denotes the constant string obtained from tt by deleting all variables 
in TT. 

Proof. Assume tt = t9 for some erasing substitution 9. Then, as easily seen, 
c(r) is a subsequence of c(7t), and thus |c(r)| < |c(7r)| holds. Let us put c(t) = 
ciC 2 ---Cm, where m = |c(r)| and ci, C 2 , ■ ■ ■ , Cm G A. Since t is a canonical 
regular term, r must be of the form aiCia 2 C 2 • • • CmOm-i-i, where at is either a 
variable or the empty string (1 < z < m -I- 1). Therefore we have |t| < 2m -I- 1 = 
2|c(r)| -I- 1 < 2|c(7r)| -I- 1. □ 



Definition 11. Let F be an EFS and let C be a clause. 

An axiom D of F is said to be used in proving C , if F \-g C but F\{D} Fg C. 

Lemma 3. Let p(w) be a around atom and let F be a canonical RFS such that 
F he p{w) 

Then the length of the head of each axiom of F used in proving p{w) ^ is 
not greater than 2|'u;| -|- 1. 

Proof. A clause C is said to be a ground instance of a clause D, if C is a ground 
clause such that C = D9 for some erasing substitution 9. 

First, we define the relation F C for a clause C as follows: 

(1) If C is a ground instance of an axiom of T, then F C. 

(2) If T A •<— i?i, • • • , and F F'^ B^+i •<— , then T A •«— i?i, • • • , B„. 
In case C is a ground clause, we can show that F Fg C if and only if F hg C. 
Since F is an SFS, for every ground instance C of an axiom of F, the length 

of the head of C is greater than or equal to the length of each atom in the body 
of C. 

Assume that F F'^ p{w) ^ is witnessed by a sequence F F'^ C\, • • •, F F'^ Cn. 
Without loss of generality, we can assume that every clause Ci with 1 < z < rz 
is used by (2) in deriving Cj for some j with i < j < n. { Otherwise, we can 
eliminate F F'^ Ci. ) 
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Then, as easily seen, the length of every head of Ci with 1 < i < n is not 
greater than the length of p{w). Therefore, by Lemma 2, the head of every axiom 
of r used by (1) is not greater than 2|r(;| + 1. 

This means that the length of the head of each axiom of T used in proving 
p{w) -ir- is not greater than 2|w| + 1. □ 

By this lemma, in a similar way to Shinohara [21,22], we can show the fol- 
lowing theorem: 

Theorem 8. Let k > 0. 

Then the class eCTZTSC-^ has finite elasticity, and thus it is inferable in the 
limit from positive examples. 

4 Simple H Systems 

4.1 Definitions and Basic Properties 

According to Mateescu et al. [14], we define simple H systems^ and their lan- 
guages. 

For three strings x,y, z € E* and for a constant symbol a G 27, we define the 
relation {x, y) h“ 2 as follows: 

{x,y) 3xi,X2,yi,V2 G s.t. x = xiax 2 , y = yiay 2 , z = Xiay 2 - 

In this relation, we call the symbol a a marker. 

For sets MCE and S C 27*, we define the sets aMiS),a\^{S) C 27* 
(z G iV U {*}) as follows: 

aM{S) = S U {z e E* 

<{S) = S, 

<^M^iS)=aM{crM{S)) 
i>0 

Definition 12 (Mateescu et al. [14]). A simple H system is a triple G = 
(27, M, A), where E is a finite alphabet, MCE, and AC E* is a finite set. The 
elements of M are called markers and those of A are called axioms. 

For a simple H system G = (E,M,A), the language L{G) generated by G is 
defined by 

L(G) = aUA). 

Hereafter, we call a simple H system an SH system simply, and by SHS, 
we denote the class of SH systems. Furthermore, a language L is called an SH 
language, if there is an SH system G such that L = L{G). By SHL, we denote 
the class of SH languages. 

^ Mateescu et al. [14] called the splicing systems H systems in honor of the originator, 
Tom Head [9], of this field. 



I 3x,j/ G S', 3a G M s.t. {x,y) h“ z}, 
(^ > 0 ), 
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Proposition 1 (Mateescu et al. [14]). Every SH language is a regular lan- 
guage, and the converse is not always true. That is, STLL C TZ£Q holds, where 
TZSG represents the class of regular languages. 



4.2 SH Languages and Erasing RFS Languages 

In the present section, we consider converting an SH system into an RFS. 

First, we prepare two special markers ( and ) that express the beginning and 
the end of a string. We assume that these markers ( and ) do not belong to a 
finite alphabet S. For a set S' C E*, we put (S) = {{w) \ w G S}. 

For a string w, let us denote by Sub(w) the set of substrings of w, and for a 
set S of strings, we put Sub(S) = U«,eS Sub(w). 

For sets S C E* and M C E, we define the set Seg^(S) C Sub((S)) as 
follows: 

SegM(^) = {wG Sub((S)) \wG{MU {(}) • (i: \ MY ■ {M U {)})}. 

For example, let E = {a,b}, M = {a} and S = {aab,bbabbb,bb}. Then (S) = 
{{aab) , (bbabbb) , (bb)} and Seg^(S) = {{a,aa,ab), {bba, abbb), (bb)}. 

In general, a finite subset of the set (M U {(}) • {E \ MY ■ {M U {)}) for 
some M C E is called a segment set. We can show that a segment set can 
directly define an SH language and that a segment set can be converted into an 
SH system (Mukouchi et al. [16]). We call a segment set Seg^(H) a segmental 
representation of an SH system G = {E, M, A). 

By Proposition 1 and Theorem 3, we see that STIC C eTEFSL holds. 

In fact, for a given SH system G = {E, M, A), we can obtain an RFS E such 
that L{G) = Lg{r,pY as follows: 

(1) Let n = {pa I a G M U {)}} be the set of unary predicate symbols. 

(2) Let us define an RFS E as follows: 

r = {pb{xaw) ^ Pa(x) I a G M, 6 G M U {)}, awb G Seg^(H)} 

Ll{pb{w) ^1 6 G M U {)}, {wb G Seg^iA)}. 



Example 2. Let E = {a,b}, M = {a} and A = {aab, bbabbb, bb}. Then let G = 
{E, M, A) be an SH system. Then Seg^(H) = {(a, aa, ab), {bba, abbb), {bb)}. 

Let n = }pa,P)} be the set of unary predicate symbols. Then let us put 



E= ( 



f Pa(s) 

Pa(xa) e- Pa(x); 
P)(xab) e- Pa(x); 

Pa(bb) 

P){xabbb) ^ Pai.x)-, 

I P) (bb) e- 



Then we can show that for each string w G E*, w G L{G) if and only if 
E l-£ p) {w) •<— by a mathematical induction on the numbers of markers appearing 
in w. Thus we have L{G) = Le{r,p)). 
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4.3 Learnability of SH Languages 

Since every finite language L C S* is generated by an SH system G = {S, (j), L), 
we see that the class ST-LC is super-finite, and thus we have the following theorem: 

Theorem 9. The class STLL is not inferable in the limit from positive examples. 

On the other hand, the following example shows that the class of infinite SH 
languages is also not inferable in the limit from positive examples. 

Example 3. Let >2 and let a, & G 17 be two distinct constant symbols. 

First, for each n > 0, let G„ = {S, {a}, be an SH system with 

An = {aababba ■ ■ • o6"a}. 



Then, as easily seen, 

L{Gn) = {aE^aE^a ■ ■ ■ aE'^a | m > 1, 0 < ii,i 2 , ■ ■ ■ Gm ^ n} 
holds. On the other hand, let Goo = {S, {a, b}, Hoo) be an SH system with 

Aoo = {aabba}. 

Then 

L{Goo) = {aE'^aE'^a- ■ ■ aE”'a | to > 1, ii, 12 , • ‘ ‘ Om > 0} 

holds. Thus it is easy to see that these languages satisfy the conditions of Lemma 
1. Therefore the class of infinite SH languages is not inferable in the limit from 
positive examples. 

By this example, we see that even if the number of axioms is restricted to a 
fixed number, STLC is also not inferable in the limit from positive examples. 

Now, we introduce a subclass of STLC that is inferable in the limit from 
positive examples. 

For each A: > 0, we define the subclass STLS-^ of STLS as the class of SH 
systems G = {E,M,A) with jJSegj;^(H) < k. Then we denote by STLC-^ the 
class of SH languages generated by SH systems in STLS-^ . 

By the conversion method shown in the previous section, for a given SH 
system G G STLS-^, it is converted into a canonical RFS E G CTZLFS-^ such 
that L(G) = Lg(E,p) for some unary predicate symbol p. This means that 
SUC^*^ C eC'RTSC^'^. 

Since eCTZTSC-^ has finite elasticity, we have the following theorem: 

Theorem 10. Let k >0. 

Then the class STLC-^ has finite elasticity, and thus it is inferable in the 
limit from positive examples. 

We note that we can also show the above theorem by using the framework 
of the so-called monotonic formal system due to Shinohara [22] . 
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5 Concluding Remarks 

We have discussed the influence of allowing erasing substitution upon inferability 
of classes of EPS languages. Especially we presented a subclass of erasing EES 
languages that is inferable in the limit from positive examples. 

Then we showed that each SH system can be converted into a canonical RES 
whose erasing EES language coincides with the SH language of the original SH 
system. By using this conversion, we showed that the subclass of SH languages 
generated by SH systems with at most k segments is inferable in the limit from 
positive examples for each k > 0. 

As easily seen, even if erasing substitutions are not allowed, each SH system 
can be converted into an RES, but the axioms become more complicated form 
and more number of axioms are needed. There seem to be many cases that 
allowing erasing substitutions are more suitable. 
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Abstract. A revision algorithm is a learning algorithm that identifies 
the target concept, starting from an initial concept. Such an algorithm 
is considered efficient if its complexity (in terms of the resource one is 
interested in) is polynomial in the syntactic distance between the initial 
and the target concept, but only polylogarithmic in the number of vari- 
ables in the universe. We give efficient revision algorithms in the model 
of learning with equivalence and membership queries. The algorithms 
work in a general revision model where both deletion and addition type 
revision operators are allowed. In this model one of the main open prob- 
lems is the efficient revision of Horn sentences. Two revision algorithms 
are presented for special cases of this problem: for depth-1 acyclic Horn 
sentences, and for definite Horn sentences with unique heads. We also 
present an efficient revision algorithm for threshold functions. 



1 Introduction 

Efficient learnability has been studied from many different angles in computa- 
tional learning theory for the last two decades, for example, in both the PAG 
and query learning models, and by measuring complexity in terms of sample size, 
the number of queries or running time. Attribute-efficient learning algorithms 
are required to be efficient (polynomial) in the number of relevant variables, and 
“super-efficient” (polylogarithmic) in the total number of variables [1,2]. It is 
argued that practical and biologically plausible learning algorithms need to be 
attribute efficient. 

A related notion, efficient revision algorithms, originated in machine learning 
[3, 4, 5, 6], and has received some attention in computational learning theory as 
well. A revision algorithm is applied in a situation where learning does not start 
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from scratch, but there is an initial concept available, which is a reasonable 
approximation of the target concept. The standard example is an initial version 
of an expert system provided by a domain expert. The efficiency criterion in this 
case is to be efficient (polynomial) in the distance from the initial concept to the 
target (whatever distance means; we get back to this in a minute), and to be 
“super-efficient” (polylogarithmic) in the total size of the initial formula. Again, 
it is argued that this is a realistic requirement, as many complex concepts can 
only be hoped to be learned efficiently if a reasonably good initial approximation 
is available. The notion of distance usually considered is a syntactic one: the 
number of edit operations that need to be applied to the initial representation 
in order to get a representation of the target. The particular edit operations 
considered depend on the concept class. Intuitively, attribute-efficient learning 
is a special case of efficient revision, when the initial concept has an empty 
representation. In machine learning, the study of revision algorithms is referred 
to as theory revision; detailed references to the literature are given in Wrobel’s 
overviews of theory revision [7,8] and also in our recent papers [9,10]. 

The theoretical study of revision algorithms was initiated by Mooney [11] 
in the PAG framework. We have studied revision algorithms in the model of 
learning with equivalence and membership queries [9,10] and in the mistake- 
bound model [12]. 

It is a general observation both in practice and in theory that those edit 
operations which delete something from the initial representation are easier to 
handle than those which add something to it. We have obtained efficient revision 
algorithms for monotone DNF with a bounded number of terms when both 
deletion and addition type revisions are allowed, but for the practically important 
case of Horn sentences we found an efficient revision algorithm only for the 
deletions-only model. We also showed that efficient revision of general (or even 
monotone) DNF is not possible, even in the deletions-only model. Finding an 
efficient revision algorithm for Horn sentences with a bounded number of terms 
in the general revision model (deletions and additions) emerged as perhaps the 
main open problem posed by our previous work on revision algorithms. The 
work presented here extends that of Doshi [13], who gave a revision algorithm 
for “unique explanations”, namely depth-1, acyclic Horn formulas with unique, 
non-F, and unrevisable heads. Horn revision also happens to be the problem 
that most practical theory revision systems address. It is to be noted here that 
the notions of learning and revising Horn formulas are open to interpretation, 
as discussed in [10]; the kind of learnability result that we wish to extend to 
revision in this paper is Angluin et al.’s for propositional Horn sentences [14]. 

It seems to be an interesting general question whether attribute-efficiently 
learnable classes can also be revised efficiently. The monotone DNF result men- 
tioned above shows that the answer is negative in general. We gave a positive 
answer for parity functions in [9] and for projective DNF in [12]. Projective DNF 
is a class of DNF introduced by Valiant [15], as a special case of his projective 
learning model, and as part of a framework to formulate expressive and biologi- 
cally plausible learning models. In biological terms revision may be relevant for 
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learning when some information is hard-wired from birth; see, e.g., Pinker [16] 
for recent arguments in favor of hereditary information in the brain. 

Valiant showed that projective DNF are attribute-efficiently learnable in the 
mistake-bound model, and we extended his result by showing that they are effi- 
ciently revisable. Our algorithm was based on showing that a natural extension 
of the Winnow algorithm is in fact an efficient revision algorithm for disjunctions 
even in the presence of attribute errors. 

Valiant’s related models [17,18] also involve threshold functions, and as 
threshold functions are also known to be attribute-efficiently learnable, this raises 
the question whether threshold functions can be revised efficiently. Threshold 
functions (also called Boolean threshold functions or zero-one threshold func- 
tions in the literature) form a much studied concept class in computational 
learning theory. Winnow is an attribute-efficient mistake-bounded learning al- 
gorithm [19]. Attribute-efficient proper query learning algorithms are given in 
Uehara et al. [20] and Hegediis and Indyk [21]. Further related results are given 
in [22,23,24,25]. 

Results. In this paper we present results for the two revision problems outlined 
above: the revision of Horn sentences and threshold functions, in the general 
revision model allowing both deletions and additions (more precise definitions 
are given in Section 2). We use the model of learning with membership and 
equivalence queries. 

For Horn sentences, we show that one can revise two subclasses of Horn 
sentences with respect to both additions and deletions of variables. The new 
algorithms make use of our previous, deletions-only revision algorithm for Horn 
sentences [10] and new techniques which could be useful for the general question 
as well. 

The first of the two classes is depth-1 acyclic Horn sentences. The class of 
acyclic Horn sentences was introduced by Angluin [26] in a paper that presented 
the first efficient learning algorithm for a class of Horn sentences, and was studied 
from the point of view of minimization and other computational aspects (see, 
e.g., [27]), and in the context of predicate logic learning [28]. We consider the 
subclass of depth-1 acyclic Horn sentences, where, in addition to assuming that 
the graph associated with the Horn sentence is acyclic, we also require this graph 
to have depth 1. One of our main results. Theorem 1, shows that this class can be 
revised using 0{dist{(p,ip)- ■ log n) queries, where n is the number of variables, 

ip is the TO-clause initial formula, i/' is the target formula, and dist is the revision 
distance, which will be defined formally in Section 2. 

We also give a revision algorithm for definite Horn sentences with unique 
heads, meaning that no variable ever occurs as the head of more than one Horn 
clause. For this class, we revise with query complexity 0{rn^ + dist{ip, ip) ■ {m^ + 
logn)), where again p is the initial formula and ip is the target function (Theo- 
rem 2). 

For threshold functions we give a revision algorithm using 0{dist(ip, ip) -logn) 
queries (Theorem 3). In this algorithm the pattern mentioned above is reversed, 
and it turns out to be easier to handle additions than deletions. It is also shown 
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that both query types are necessary for efficient revision, and that the query 
complexity of the algorithm is essentially optimal up to order of magnitude. 
Another interesting point is that the natural extension of Winnow mentioned 
above does not work in this more general context. 

Organization of paper. Preliminaries are given in Section 2, Horn formula revi- 
sions in Section 3, threshold functions in Section 4. Due to space constraints, 
complete proofs are deferred to the full version of the paper. 

2 Preliminaries 

We use standard notions from propositional logic such as variable, literal, term 
(or conjunction), clause (or disjunction), etc. The set of variables for rr-variable 
formulas and functions is A„ = {xi , . . . , a;„}. (In this paper, n will always be the 
total number of variables.) Instances or vectors are elements x G {0, 1}". When 
convenient we treat x as a subset of [n] or A„, corresponding to the components, 
resp. the variables, which are set to true in x. Given a set P C [n] = {1, . . . , n}, 
we write xv = (oi, . . . , a„) G {0, 1}" for the characteristic vector of Y. We write 
X = (xi, . . . ,Xn) < y = {yi, ■ ■ ■ ,yn) Xi < Pi for every i = 1, . . . , n. 

A Horn clause is a disjunction with at most one unnegated variable; we will 
usually think of it as an implication and call the clause’s unnegated variable its 
head, and its negated variables its body. We write body(c) and head(c) for the 
body and head of c, respectively. A clause with an unnegated variable is called 
definite (or positive). We will consider clauses with no unnegated variables to 
have head F, and will sometimes write them as (body — >■ F). A Horn sentence 
is a conjunction of Horn clauses. A Horn sentence is definite if all its clauses 
are definite. A Horn sentence has unique heads if no two clauses have the same 
head. 

We define the graph of a Horn sentence to be a directed graph on the variables 
together with T and F, with an edge from variable u to variable v (resp. F iff 
there is a clause with head v (resp. F) having u in its body, and an edge from T 
to variable v if there is a clause consisting solely of variable v. A Horn sentence 
is acyclic if its graph is acyclic; the depth of an acyclic Horn sentence is the 
maximum path length in its graph [26]. 

For example, the Horn sentence 

(xi A X2 ^ x^) A {x\ A X 4 — >■ x^) A ( 0:4 A Xe — >■ F) 

is depth-1 acyclic. Its graph has the edges (xi,xs), {x 2 ,xfi), {xi,x^), {xi,x^), 
(x 4 ,F), and (xe,F) and this graph is acyclic with depth 1. 

If X satisfies the body of Horn clause c, considered as a term, we say x 
covers c. Notice that x falsifies c if and only if x covers c and head(c) ^ x. (By 
definition, F ^ x.) 

For Horn clause body b (or any monotone term) and vector x, we use 5 fl x 
for the monotone term that has those variables of b that correspond to I’s in x. 
As an example, X 1 X 4 fl 1100 = xi. 
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An n-variable threshold function TH^ is specified by a set U C [n] and a 
threshold 0 < t < n, such that for a vector x = (cci, . . . ,Xn) G {0, 1}" it holds 
that THy (x) = 1 iff at least t of the variables with subscripts in U are set to 1 in 
X. In other words, TH^(x) = 1 iff X)r=i where xu = (tti, ■ • ■ , On)- We 

say that S' is a positive (resp., negative) set if xs is a positive (resp., negative) 
example of the target threshold function. (As the number of variables is clear 
from the context, we do not mention it in the notation.) Note that for every 
non-constant threshold function its set of relevant variables and its threshold 
are well defined, thus every non-constant function has a unique representation. 
The variables with indices in U (resp., outside of U) are the relevant (resp., 
irrelevant) variables of TH^. As noted in the introduction, functions of this 
type are also called Boolean threshold functions and 0- 1-threshold functions, 
in order to distinguish them from the more general kind of threshold functions 
where the coefficients Oj can be arbitrary reals. We simply call them threshold 
functions, as we only consider this restricted class. 

We use the standard model of membership and equivalence queries (with 
counterexamples), denoted by MQ and EQ [29]. In an equivalence query, the 
learning algorithm proposes a hypothesis, a concept h, and the answer depends 
on whether h = c, where c is the target concept. If so, the answer is “correct”, 
and the learning algorithm has succeeded in its goal of exact identification of 
the target concept. Otherwise, the answer is a counterexample, any instance x 
such that c(x) ^ /i(x). 



2.1 Revision 

The revision distance between a formula (p and a concept C is defined to be the 
minimum number of applications of a specified set of syntactic revision operators 
to ip needed to obtain a formula for C. The revision operators may depend on 
the concept class one is interested in. Usually, a revision operator can either be 
deletion-type or addition-type. 

For disjunctive or conjunctive normal forms, the deletion operation can be 
formulated as fixing an occurrence of a variable in the formula to a constant. In 
the general model, studied in this paper, we also allow additions. The addition 
operation is to add a new literal to one of the terms or clauses of the formula. 
(Adding a new literal to open up a new clause or term would be an even more 
general addition-type operator, which we have not considered so far.) In the case 
of Horn sentences the new literals must be added to the body of a clause. 

In the case of threshold functions, deletions mean deleting a relevant variable 
and additions mean adding a new relevant variable. In the general model for this 
class we also allow the modification of the threshold. We consider the modification 
of the threshold by any amount to be a single operation (as opposed to changing 
it by one); as we are going to prove upper bounds, this only makes the results 
stronger. Thus, for example, the revision distance between p = TH|^^ x 2 ,x 4 } 
™{x 4 ,x 2 ,x 3 ,x^} is 4 in the general model. 
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We use dist{ip, ip) to denote the revision distance from (p to tp whenever 
the revision operators are clear from context. In general, the distance is not 
symmetric. 

A revision algorithm for a formula p> has access to membership and equiva- 
lence oracles for an unknown target concept and must return some representa- 
tion of the target concept. Our goal is to find revision algorithms whose query 
complexity is polynomial in c? = dist{ip,ip), but at most polylog arithmic in n, 
the number of variables in the universe. Dependence on other parameters may 
depend on the concept class. For DNF (resp. CNF) formulas, we will allow poly- 
nomial dependence on the number of terms (resp. clauses) in ip] it is impossible 
to do better even for arbitrary monotone DNF in the deletions-only model of 
revision [9]. 

We state only query bounds in this paper; all our revision algorithms are 
computable in polynomial time, given the appropriate oracles. 

2.2 Binary Search for New Variables 

Many of our revision algorithms use a kind of binary search, often used in learning 
algorithms involving membership queries, presented as Algorithm 1. The starting 
points of the binary search are two instances, a negative instance neg and a 
positive instance pos such that neg < pos. The algorithm returns two items: 
the first is a set of variables that when added to neg make a positive instance; 
the second is a variable that is critical in the sense that the first component plus 
neg becomes a negative instance if that variable is turned off. 



Algorithm 1 BiNARYSEARCH(neg, pos). 

Require: MQ(neg) == 0 and MQ(pos) == 1 and neg < pos 
1: negp := neg 

2: while neg and pos differ in more than 1 position do 

3: Partition pos \ neg into approximately equal-size sets di and d 2 - 

4: Put mid := neg with positions in di switched to 0 

5: if MQ(mid) == 0 then 

6: neg := mid 

7: else 

8: pos := mid 

9: V := the one variable on which pos and neg differ 
10: return ((pos \ negp), u) 



3 Revising Horns 

In this section we give algorithms for revising two different classes of Horn sen- 
tences when addition of new variables into the bodies is allowed as well as deletion 
of variables. 
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3.1 Depth-1 Acyclic Horn Sentences 

We show here how to revise depth-1 acyclic Horn sentences. Depth-1 acyclic Horn 
sentences are precisely those where variables that occur as heads never occur in 
the body of any clause. Notice that such formulas are a class of unate CNF. 
Previously we gave a revision algorithm for unate DNF (which would dualize 
to unate CNF) that was exponential in the number of clauses [9]. Here we give 
an algorithm for an important subclass of unate CNF that is polynomial in the 
number of clauses. 

The general idea of the algorithm is to maintain a one-sided hypothesis, in 
the sense that all equivalence queries using the hypothesis must return negative 
counterexamples until the hypothesis is correct. 

Each negative counterexample can be associated with one particular head 
of the target clause, or else with a headless target clause. We do this with a 
negative counterexample x as follows. 

For a head variable v and instance x, we will use the notation x" to refer to 
X modified by setting all head variables other than v to 1. Note that x*' cannot 
falsify any clause with a head other than v. Since v will normally be the head of 
a Horn clause and we use F to denote the “head” of a headless Horn clause, we 
will use x^ to denote x modified to set all head variables to 1. 

The algorithm begins with an assumption that the revision distance from the 
initial theory to the target theory is e. If the revision fails, then e is doubled 
and the algorithm is repeated. Since the algorithm is later shown to be linear 
in e, this series of attempts does not affect the asymptotic complexity. We give 
a brief overview of the algorithm, followed by somewhat more detail, and the 
pseudocode, as Algorithm 2. 

We maintain a hypothesis that is, viewed as the set of its satisfying vectors, 
always a superset of the target. Thus each time we ask an equivalence query, 
if we have not found the target, we get a negative counterexample x. Then the 
first step is to ask a membership query on x modified to turn on all of the 
head positions. If that returns 0, then the modified x must falsify a headless 
target clause. Otherwise, for each head position h that is 0 in the original x, 
ask a membership query on x^. We stop when the first such membership query 
returns 0; we know that x falsifies a clause with head h. In our pseudocode, we 
refer to the algorithm just described as Associate. 

Once a negative counterexample is associated with a head, we first try to use 
it to make deletions from an existing hypothesis clause with the same head. If 
this is not possible, then we use the counterexample to add a new clause to the 
hypothesis. We find any necessary additions when we add a new clause. 

If x^ n body(c)^ is a negative instance, which we can determine by a mem- 
bership query, then we can create a new smaller hypothesis clause whose body 
is X n body(c). (Notice that x O body(c) C body(c) because as a negative coun- 
terexample, X must satisfy c.) 

To use X to add a new clause, we then use an idea from the revision algorithm 
for monotone DNF [9] . For each initial theory clause with the same head as we 
have associated (which for F is all initial theory clauses, since deletions of heads 
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are allowed), use binary search from x intersect (the initial clause with the other 
heads set to 1) up to x. If we get to something negative with fewer than e 
additions, we update x to this negative example. 

Whether or not x is updated, we keep going, trying all initial theory clauses 
with the associated head. This guarantees that in particular we try the initial 
theory clause with smallest revision distance to the target clause that x falsifies. 
All necessary additions to this clause are found by the calls to BinarySearch; 
later only deletions will be needed. 



Algorithm 2 HornReviseUpToE(<p, e). Revises depth-1 acyclic Horn Sen- 
tence ip if possible using < e revisions; otherwise returns failure. 

1: H ;= everywhere-true empty conjunction 
2: while (x EQ(R)) “Correct!” and e > 0 do 
3: h :=AsSOCIATe(x, ip) 

4: for all clauses c £ H with head h do 

5: if MQ(x^ Pi body(c)^) == 0 then {delete vars from c} 

6: body(c) = body(c) n x 

7: e = e— number of variables removed 

8: if no vars. were deleted from any clause then (find new clause to add} 

9: min = e 

10: FoundAClause= false 

11: for all c £ ip with head h (or &\\ c £ ip ii h == F) do 

12: new = body(c)^ n x^ 

13: numAddedLits = 0 {ff additions for this c} 

14: while MQ(new) == 1 and numAddedLits < e do 

15: I := BlNARYSEARCH(new, x^) 

16: new:=newU{l} 

17: numAddedLits = numAddedLits + 1 

18: if MQ(x — {1}) == 0 then {(x — {1}) is “Pivot”} 

19: {i.e., x — {1} is counterexample falsifying fewer target clauses} 

20: X = x — {1} 

21: restart the for all c loop with this x by backing up to Line 11 to reset 

other parameters 

22: if MQ(new) == 0 then 

23: X := new 

24: FoundAClause = true 

25: min ~ xmn[numAddedLits,min) 

26: if not FoundAClause then 

27: return “Failure” 

28: else 

29: F[ ■.= H A {x ^ h) {treating x as monotone disjunction} 

30: e := e — min 

31: if X == “Correct!” then 
32: return H 

33: return “Failure” 





New Revision Algorithms 403 



Theorem 1. There is a revision algorithm for depth-1 acyclic Horn sentences 
with query complexity 0{d ■ • logn), where d is the revision distance and m 

is the number of clauses in the initial formula. 

Proof sketch: We give here some of the highlights of the proof of correctness 
and query complexity of the algorithm; space does not permit detailed proofs. 
Relatively straightforward calculation shows that the query complexity of Horn- 
ReviseUpToE for revising an initial formula of m clauses on a universe of n 
variables is polynomial in m, logn, and the parameter e. Thus, if we can argue 
that when e is at least the revision distance the algorithm succeeds in finding 
the target, we are done. 

The result follows from a series of lemmas. The first two lemmas give quali- 
tative information. The first shows that the hypothesis is always one-sided (i.e., 
only negative counterexamples can ever be received), and the second says that 
newly added hypothesis clauses are not redundant. 

Lemma 1. The algorithm maintains the invariant that its hypothesis is true for 
every instance that satisfies the target function. 

Proof. Formally the proof is by induction on number of changes to the hypoth- 
esis. The base case is true, because the initial hypothesis is everywhere true. 

For the inductive step, consider how we update the hypothesis, either by 
adding a new clause or deleting variables from the body of an existing clause. 

Before creating or updating a clause to have head h and body y, we have 
ensured that MQ(y'*) = 0, that is, that y^ is a negative instance. Because of 
that, y^ must falsify some clause, and because of its form and the syntactic form 
of the target, it must be a clause with head h. None of the heads in y^ \ y can 
be in any body, so y must indeed be a superset of the variables of some target 
clause with head h, as claimed. 

Lemma 2. If negative counterexample x is used to add a new clause with head 
h to the hypothesis, then the body of the new clause does not cover any target 
clause body covered by any other hypothesis clause with head h. 

Proof. Recall that head h was associated with x. If x falsified the same target 
clause as an existing hypothesis clause body, then x would be used to delete 
variables from that hypothesis clause body. Therefore x does not falsify the 
same target clause as any existing hypothesis clause with the same head, and 
the newly added hypothesis clause’s body is a subset of x. 

The following two lemmas, whose proof will be given in the journal version 
of this paper, complete the proof. 

Lemma 3. HornReviseUpToE((/?, e) succeeds in finding the target Horn sen- 
tence ip if it has revision distance at most e from initial formula ip. 

Lemma 4. The query complexity of HornReviseUpToE is 0{mf ■ e ■ logn), 
where the initial formula has m clauses and there are n variables in the universe. 
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3.2 Definite Horn Sentences with Unique Heads 

We give here a revision algorithm for definite Horn sentences with unique heads. 
Since we are considering only definite Horn sentences, note that the head vari- 
ables cannot be fixed to 0. We use the algorithm for revising Horn sentences in 
the deletions-only model presented in [10] as a subroutine. Its query complexity 
is 0{dm^ -\- m^), where d is the revision distance and m is the number of clauses 
in the initial formula. 

This algorithm has a first phase that finds all the variables that need to be 
added to the initial formula. That partially revised formula is then passed as an 
initial formula to the known algorithm [10] for revising Horn sentences in the 
deletions-only model of revision. 

To find all necessary additions to the body b of clause c = (& — >■ h), we first 
construct the instance as follows: in the positions of the heads of the initial 
formula, instance Xc has I’s, except for a 0 in position h. Furthermore, instance 
Xc has I’s in all the positions corresponding to a variable in the clause’s body, 
and 0 in all positions not yet specified. 

Next, the query MQ(xc) is asked. If MQ(xc) = 0, then no variables need to 
be added to the body of c. If MQ(xc) = 1, the necessary additions to the body 
of c are found by repeated use of BinarySearch. To begin the binary search, 
Xc is the known positive instance that must satisfy the target clause c* derived 
from c, and the assignment with a 0 in position h and a 1 everywhere else is the 
known negative instance that must falsify c». 

Each variable returned by BinarySearch is added to the body of the clause, 
and Xc is updated by setting the corresponding position to 1. The process ends 
when Xc becomes a negative instance. 

Once the necessary additions to every clause in the initial theory are found, a 
Horn sentence needing only deletions has been produced, and the deletions-only 
algorithm from [10] can be used to complete the revisions. 

Theorem 2. There is a revision algorithm for definite Horn sentences with 
unique heads in the general model of revision with query complexity 0{m‘^ + 
dm^ + dlogn), where d is the revision distance from the initial formula to the 
target formula. 

Proof sketch. The key part is adding variables to one initial clause c = (b ^ h). 

Let c* be the target clause derived from c. 

Lemma 5. Every variable added to a clause c must occur in the target clause 
c* that is derived from c. 

(Proof of lemma omitted.) 

This is enough to allow us to use the earlier algorithm for learning in the 
deletions-only model. 

The query complexity for the part of the algorithm that finds necessary 
additions is at most the O(logn) per added variable, which contributes a factor of 
Ip) -logn). The deletions-only algorithm has complexity (m^ + dm^), where 
d is the revision distance. Combining these two gives us 0{vif -\- dm^ + dlogn). 

□ 
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4 Revising Threshold Functions 

We present a threshold revision algorithm ReviseThreshold. The overall re- 
vision algorithm is given as Algorithm 3, using the procedures described in Al- 
gorithms 4 and 5. Algorithm ReviseThreshold has three main stages. First 
we identify all the variables that are irrelevant in ip but relevant in t/; (Algo- 
rithm Find Additions). Then we identify all the variables that are relevant 
in p but irrelevant in ip (Algorithm FindDeletions). Finally, we determine 
the target threshold. (In our pseudocode this third step is built into Algorithm 
FindDeletions, as the last iteration after the set of relevant variables of the 
target function is identified.) 



Algorithm 3 The procedure ReviseThreshold((/j) 

1: {function to be revised is = TH{/} 

2: Use 2 EQ’s to determine if target is constant 0 or 1; if so return 
3: V ;= FindAdditions)!/) 

4: return FindDeletions)!/, V) 



The main result of the section is the following. 

Theorem 3. ReviseThreshold is a threshold function revision algorithm of 
query complexity O(dlogn), where d is the revision distance between the initial 
formula p and the target function ip. 

Proof sketch. Throughout, let p = and ip = THf^. 



Algorithm 4 The procedure FindAdditions(! 7) 
Require: the target function is not constant 
1: Potentials := Xn \ t/; NewRelevants := 0 
2: if MQ(vn) == 0 then 
3: Base := U 

4: else 

5: (Base,x) := BinarySearch(0, !/), Base := Base \ {x} 

6: if MQ(v5aseuPofentiais) ~~ 0 then 

7: return 0 

8: repeat 

9: (Y, y) := BlNARYSEARCH(Base, Base U Potentials) 

10: NewRelevants := NewRelevants U {j/} 

11: Potentials := Potentials \ {j/} 

12: if ^/Np(^\Base\JPotentials) 0 then 

13: Base ~ Base U {y} 

14: until MQ(xsase) == 1 
15: return NewRelevants 
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A set S is critical for the target function fj, or simply critical, if |S'ni?| = 6—1. 
It is clear from the definition that if S is critical, then for every Z C A„ \ S' it 
holds that Z contains at least one relevant variable of ip iff MQ(xsuz) = 1- 



Algorithm 5 The procedure FindDeletions(17, V) 

Require: U, V disjoint; 1/ C _R and i? C [/ U F {R ~ relevant variables in target) 
1: U ~ U\ 

2: u~\U\ + \V\-,i~l 

3: if {ip' := TestExtreme(A, P,U U F)) constant then 
4: return p' 

5: while u> t + 1 do 
6: m ~ \{u + ■f)/2] 

7: if (x := EQ(THJ7^^)) == YES then 

8: return 

9: let G := X n (f/ U E) 

10: if X is a positive counterexample then 

11: P C and u ~ m 

12: else 

13: N := C and £ := m 

14: (P,p) ~ BinarySearch(0, P) 

15: Base := P Pi N , P' := P \ Base, N' ~ N \ Base 
{Now the key property holds for Base, N' and P'} 

16: Test := {Base U P') \ {p} 

(For any i £ N' , MQ(xrestup}) = 1 iff i is relevant} 

17: while |P'| > 1 do 

18: i := MAKEEvEN(A',P',Base) 

{Make I A' I and IP' I be even without spoiling the key property) 

19: if MQ(xTestui) == 0 then 

20: U ~U — i and goto Line 2 

21: Let Nq,N\ (resp. Po,Pi) be an equal-sized partition of N' (resp. P') 

22: Ask MQ(xsoseuiV 3 UPfe) for j, fc = 0, 1 

23: Let j and k be indices s.t. MQ(xBaseuVjUPfc ) = 0 {such j and k exist) 

24: Base := Base U Pk, P' '■= Pi-k, N' := Ni 

25: U :=U\N' 

26: goto Line 2 



Procedure FindAdditions (Algorithm 4) finds the new relevant variables; 
that is, the elements of RPU , where U = A„ \ U. It stores the uncertain 
but potentially relevant variables in the set Potentials (thus Potentials is ini- 
tially set to Xn \ U). The procedure first determines a set Base C U such that 
Base is negative, and Base U Potentials is positive (unless Potentials contains 
no relevant variables — in which case there are no new relevant variables used 
by Ip, so we quit). Then the search for the new relevant variables starts. We 
use BiNARYSEARCH(Pase, Base U Potentials) to find one relevant variable. This 
variable is removed from Potentials, and the process is repeated, again using 
BinarySearch. After removing a certain number of relevant variables from 
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Potentials, the instance Base U Potentials must become critical. After reaching 
this point, we do not simply remove any newly found relevant variables from 
Potentials, but we also add them to the set Base. This way from then on it 
holds that \{Base U Potentials) (1 R\ = 0. Thus the indicator that the last rele- 
vant variable has been removed from Potentials is that Base becomes positive 
(MQ(xSase) = !)• 

Let us assume that we have identified a set of variables that is known to 
contain all the relevant variables of the target function, and possibly some addi- 
tional irrelevant variables. Consider threshold functions with the set of variables 
above, and all possible values of the threshold. Let us perform a sequence of 
equivalence queries with these functions, doing a binary search over the thresh- 
old value (moving down, resp. up, if a positive, resp. negative, counterexample 
is received). In algorithm FindDeletions this is done by the first while loop 
starting at Line 5 and the use of TestExtreme right before it at Line 3 (the 
latter is needed to ensure the full range of search; it performs the test for the two 
extreme cases in the binary search, which the while loop might miss: the con- 
junction and the disjunction of all the variables). In case the relevant variables 
of the target functions are exactly those that we currently use, then one can see 
that this binary search will find ip- Otherwise (i. e. when some of the currently 
used variables are irrelevant in ^l)) it can be shown that after the above binary 
search we always end up with a “large” negative example (xn) and a “small” 
positive example (xp); more precisely they satisfy |P| < \N\. Using these sets 
one easily obtains three sets Base,N' and P' that have the key property. 

Key property: Sets Base, N' , and P' satisfy the key property if they are pair- 
wise disjoint, and it holds that Base\JN' is negative, \{Base\JP')C\R\ = 9, and 
IIV'I > |P'|. 

The following claim gives two important features of this definition. 

Claim 1 a) If Base, N' and P' satisfy the key property then N' eontains an 
irrelevant variable and P' contains a relevant variable. 

b) If Base, N' and P' satisfy the key property and \P'\ = 1 then every element 
of N' is irrelevant. 

From now on we maintain these three sets in such a manner that they preserve 
the key property, but in each iteration the size of N' and P' get halved. For this 
we split up N' (respectively P') into two equal sized disjoint subsets Ni and N 2 
(resp. P\ and P 2 ). When both |A^'| and \P'\ are even then we can do this without 
any problem; otherwise we have to make some adjustments to N' and/or to P' , 
that will be taken care of by procedure MakeEven (the technical details are 
omitted due to space limitations). Using the notation 6' = 9 — \R D Base\ we 
have \Rn (iVi U A^ 2 )| < 9' and |i?n (Pi UP 2 )| = 9'. Thus for some j, k G {0, 1} we 
have |P n {Nj U Pfc)| < 9' (equivalently MQ(xsaseuiVjUPfc) = 0). Note that the 
sets Base := Base U Pk, N' := Nj and P' := Pi-k still have the key property, 
but the size of N' and P' is reduced by half. Thus after at most log n steps P' is 
reduced to a set consisting of a single (relevant) variable. Thus N' is a nonempty 
set of irrelevant variables (part &/ of Claim 1). □ 
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We mention some additional results showing that both types of queries are 
necessary for efficient revision, the query bound of algorithm ReviseThreshold 
cannot be improved in general and that Winnow (see [19]) cannot be used for 
revising threshold functions (at least in its original form). 

Theorem 4. a) Efficient revision is not possible using only membership queries, 
or only equivalence queries. 

b) The query complexity of any revision algorithm for threshold functions is 
fl (dlog d) queries, where d is the revision distance. 

c) Winnow is not an efficient revision algorithm for threshold functions be- 

cause it may make too many mistakes. More precisely, for any weight vector 
representing the initial threshold function Winnow can make n mis- 

takes when the target function is 

Proofs will be given in the full version of this paper. 
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Abstract. The paper identifies several new properties of the lattice in- 
duced by the subsumption relation over first-order clauses and derives 
implications of these for learnability. In particular, it is shown that the 
length of subsumption chains of function free clauses with bounded size 
can be exponential in the size. This suggests that simple algorithmic 
approaches that rely on repeating minimal subsumption-based refine- 
ments may require a long time to converge. It is also shown that with 
bounded size clauses the subsumption lattice has a large branching fac- 
tor. This is used to show that the class of first-order length-bounded 
monotone clauses is not properly learnable from membership queries 
alone. Finally, the paper studies pairing, a generalization operation that 
takes two clauses and returns a number of possible generalizations. It is 
shown that there are clauses with an exponential number of pairing re- 
sults which are not related to each other by subsumption. This is used to 
show that recent pairing-based algorithms can make exponentially many 
queries on some learning problems. 



1 Introduction 

The field of Inductive Logic Programming (ILP) is concerned with developing 
theory and methods that allow for efficient learning of classes of concepts ex- 
pressed in the language of first-order logic. Subsumption is a generality relation 
over first order clauses that induces a quasi-order on the set of clauses. The sub- 
sumption lattice is of crucial importance since many ILP algorithms perform a 
search over this space and as a result the lattice has been investigated extensively 
in the literature (see survey in [15]). The paper contributes to this study in two 
ways. First, we expose and prove new properties of the subsumption lattice of 
first-order clauses. Second, we use these properties to prove negative learning 
results in the model of exact learning from queries. These results illustrate the 
connection between the subsumption lattice and learning. 

This work arises from the study of query complexity of learning in first order 
logic. Several positive learnability results exist in the model of exact learning 
from queries [1]. However, except for a “monotone-like case” [19] the query com- 
plexity is either exponential in one of the crucial parameters (e.g. the number of 
universally quantified variables) [13,3] or the algorithms use additional syntax- 
based oracles [7,20,18]. It is not clear whether the exponential dependence is 
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necessary or not. Previous work in [4] showed that the VC-dimension cannot 
resolve this question. The current paper explores how properties of subsumption 
affect this question. 

We start by considering the length of proper subsumption chains Ci ^ C 2 ^ 
.. ^ c„ of first order clauses of restricted size. This is motivated by two is- 
sues. First, many ILP algorithms (e.g. [21,17,8]) use refinement of clauses where 
in each step the clause is modified using a minimal subsumption step. Thus 
the length of subsumption chains hinges on convergence of such approaches. 
A second motivation comes from the use of certificates [12,11] to study query 
complexity. It is known [12,11] that a class C is learnable from equivalence and 
membership queries if and only if the class C has polynomial certificates. Pre- 
vious work in [6] developed certificates for propositional classes. In particular, 
one of the constructions of certificates for Horn expressions uses the fact that all 
proper subsumption chains of propositional Horn clauses are short. Hence any 
generalization of this construction to first order logic relies on the length of such 
chains. 

Section 3 shows that subsumption chains can be exponentially long (in num- 
ber of literals and variables) even with function free clauses with a bounded 
number of literals. This result suggests that simple algorithmic approaches that 
rely only on minimal refinement steps may require a long time to converge and 
excludes simple generalizations of the certificate construction. We also show that 
if one imposes inequalities on all terms in a clause then subsumption chains are 
short. This further supports the use and study of inequated expressions as done 
e.g. in [13,3,9]. 

The chain length result gives an informal argument against certain ap- 
proaches. Section 4 uses a similar construction to show that the class of length- 
bounded monotone first order clauses is not properly learnable using membership 
queries only. This result is derived by studying the lattice structure of length 
bounded clauses and using it to show that the teaching dimension [2,10] is expo- 
nential in the size. The result follows since the teaching dimension gives a lower 
bound for the number of membership queries required to learn a class [2,10]. 

Finally in Section 5 we address the complexity of the algorithms given in [13, 
3] discussed above. One of the sources of exponential dependence on the number 
of variables is the number of pairings. Intuitively, a pairing is an operation that, 
given two first-order clauses, results in a new clause which is more general than 
the initial ones; two clauses have many pairings and the algorithm enumerates 
these in the process of learning. Results in [13,3] gave an upper-bound on the 
number of pairings, but left it open whether a large number of pairings can 
actually occur in examples. We give an exponential lower bound (in number of 
variables) on the number of pairings and construct an explicit example showing 
that the algorithm can be forced to make an exponential number of queries. 

Due to space limitations several proofs are omitted from the paper; they can 
be found in [5]. 




412 



M. Arias and R. Khardon 



2 Preliminaries 

We assume familiarity with basic concepts in first order logic as described e.g. 
in [14,15]. We briefly review notions relevant to this paper. 

A signature S consists of a finite set of predicates P and a finite set of 
functions F, both with their associated arities. Constants are functions with arity 
0. A countable set of variables xi,X 2 ,xs, ... is used to construct expressions. A 
variable is a term. If t\, . . . ,tn are terms and f G F is a, function symbol of 
arity n, then /(ti, is a term. An atom is an expression p{t\, ...,t„) where 

p € P is a predicate symbol of arity n and ti, ..., are terms. An atom is called a 
positive literal. A negative literal is an expression -■? where I is a positive literal. A 
clause is a disjunction of literals where all variables are universally quantified. A 
Horn clause has at most one positive literal and an arbitrary number of negative 
literals. A Horn clause -'PiV...V-'p„Vp„+i is equivalent to its implicational form 
Pi A ... Apn -A Pn+i. We call p\ A ... Apn the antecedent and Pn+i the consequent 
of the clause. A meta-clause is a pair of the form [s, c], where both s and c are 
sets of atoms such that s fl c = 0; s is the antecedent of the meta-clause and 
c is the consequent. Both are interpreted as the conjunction of the atoms they 
contain. Therefore, the meta-clause [s, c] is interpreted as the logical expression 
s — >■ 6. An ordinary clause C = Sc ^ be corresponds to the meta-clause 
[sc,{6c}]. Fully inequated clauses [3] are clauses whose terms are forced to be 
always distinct. That is, any instance of a fully inequated clause is not allowed 
to unify any of its terms. This can be done by adding explicit inequalities on all 
terms as in: if = [x yf f{x)] A [x yf a] A [a yf f{x)] A p{x, f{x)) A p{a, x) -A q{a). 

We use the symbol ‘ to denote logical implication which is defined following 
the standard semantics of first-order logic. 

We need several parameters to quantify the complexity of a first-order ex- 
pressions; we use the first-order expression E = ~^p{x, f{x)) V -•p{a, h) V q{h) to 
illustrate these. NTerms(-): counts the number of distinct terms in the input ex- 
pression. Hence, NTerms{E) = 4 corresponding to the term set {x,a, f{x),b}. 
WTerms(-): similar to NTerms, with the only difference that functional terms 
are given twice as much weight as variables. Hence, WTerms{E) = 7 since terms 
in {a, f{x), b} contribute 2 and x contributes 1. N Liter als{-)\ counts the number 
of literals in the input expression. Hence, NLiterals(E) = 3. 

Let C, D be two arbitrary first-order clauses. We say that a clause C subsumes 
a clause D and denote this by C ^ D if there is a substitution 9 such that 
C ■ 6 Q D. Moreover, they are subsume-equivalent, denoted C ^ D, ii C < D 
and D < C. We say that C strictly or properly subsumes D, denoted C ^ D, 
if C ^ D but D C. The relation ^ is reflexive and transitive and hence it 
induces a quasi-order on the set of clauses. 

3 On the Length of Proper Chains 

In this section we study the length of proper subsumption chains of clauses 
Cl ^ C 2 ^ ^ c„. It is known that infinite chains exist if one does not restrict 

clause size [16,15] but bounds for clauses of restricted size (which are necessarily 
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finite) were not known before. We show that in the case of fully inequated clauses, 
the length of any proper chain is polynomial in the number of literals and the 
number of terms in the clauses involved. On the other hand, if clauses are not 
fully inequated, then chains of length exponential in the number of variables (or 
literals) exist, even if clauses are function free. 



3.1 Subsumption Chains for Fully Inequated Clauses Are Short 

We say that a substitution 6 is unifying w.r.t. a clause ci if there exist two 
distinct terms t, t' in ci that have been syntactically unified i.e. t ■ 6 = t' ■ 9. It 
is easy to verify the following lemma: 

Lemma 1. Let ci,C 2 he two fully inequated clauses. If c\ < C 2 , then (1) it must 
he via a non-unifying substitution w.r.t. ci, (2) WTerms{ci) < WTerms{c 2 ), 
and (3) N Liter als{c\) < N Liter als{c 2 ) . 



Lemma 2. Let ci,C 2 he fully inequated clauses such that c\ -< C 2 . Then, either 
N Liter als{c\) < N Liter als{c 2 ) or WTerms{ci) < WTerms{c 2 ). 

Proof. By the previous lemma we only need to disprove the possibility that both 
N Liter als{c\) = N Liter als{c 2 ) and WTerms{ci) = WTerms{c 2 ). Suppose so, 
and let 6 be the substitution such that c\9 C C 2 . Then 9 induces a 1-1 mapping 
of terms. Now if 9 maps a variable to a non- variable term then WTerms{ci) < 
WTerms{c 2 ). So 9 must be a variable renaming. If 0 is a variable renaming 
and N Liter als{c\) = N Liter als{c 2 ) , then ci and C 2 must be syntactic variants, 
contradicting the assumption that C 2 ci. □ 

As a result each step in a strict subsumption chain reduces one of N Literals 
or WTerms and since these are bounded by 2t and I respectively we get: 

Theorem 1. The longest proper subsumption chain of fully inequated clauses 
with at most t terms and I literals is of length at most 2t 1. 



3.2 Function Free Clauses Have Long Proper Chains 

In this section we demonstrate that function free first-order clauses can produce 
chains of exponential length. We start with a simple construction where the arity 
of predicates is not constant. 

Let p be a predicate symbol of arity a. The chain di ^2 •• is defined 

inductively. The first clause is d\ = p{z , .., z), and given a clause di = pi,P 2 , --jPk, 
we define the next clause di+i as follows: (1) if pi contains only two occurrences 
of the variable 2 , then di+i = p 2 , ..,Pk, or else (2) ifpi contains c > 3 occurrences 
of the variable z, replace the atom pi by a new set of atoms p'l, - .,p'i^> such that 
k' = min{c,l — A: -I- 1), and every new atom p' for 1 < j < fc' is a copy of pi 
in which the j’th occurrence of the variable 2 has been replaced by a new fresh 
variable not appearing in di (the same variable for all copies). 
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Example 1. Suppose p has arity 4 and that 1 = 3. The construction produces 
the following chain of length 11: 

p{z,z,z,z) 

>~ p{xi,z, z, z),p{z, Xi,z, z),p{z, z, Xi,z) 

>- p{xi,X2, z, z),p{z, Xi, z, z),p{z, z, Xi,z) 

>- p{z,xi,z,z),p{z,z,xi,z) 

>- p{x2, Xi,Z, z),p{z, Xi,X2, z),p{z, Z, Xi, z) 
p(z,Xi,X2,z),p(z,Z,Xi,z) 
p(z,z,xi,z) 

p(x2, Z, Xi, z),p(z, X2, Xi, z),p(z, Z, Xi, X2) 
p(z,X2,Xi,z),p(z,Z,Xi,X2) 
p(z, Z,Xi,X2) 

^ 0 



Let -/V(c, s) be the number of subsumption generalizations that can be pro- 
duced by this method when starting with a singleton clause which is allowed to 
expand on s literals (i.e., I = s-l- 1) and whose only atom has c > 2 occurrences of 
the variable z. Then, the following relations hold: (A) N{2, s) = 1, for all s > 0. 
To see this note that when there are only 2 occurrences of the variable z, the 
only possible step is to remove the atom, thus obtaining the empty clause. (B) 
JV(c, 0) = c — 1, for all c > 2. This is derived by observing that when we have 
c > 2 occurrences of the distinguished variable z and no expansion on the number 
of literals is possible, we can apply c— 2 steps that replace occurrences of z by new 
variables, and a final step that drops the literal. After this, no more generaliza- 
tions are possible. (C) N(c, s) = l+I]i=max(o,s-c-i-i) ^(c-1, *), for all c > 2, s > 
0. This recurrence is obtained by observing that the initial clause containing our 
single atom can be replaced by min(c, s-|-l) “copies” in a first generalization step 
leaving g = max(0, s — c-l- 1) empty slots. After this, each of these copies which 
contain c — 1 occurrences of the distinguished variable z, go through the series 
of generalizations: the left-most atom has q positions to use for its expansion 
and is generalized N(c—1, q) times until it is finally dropped; the next atom has 
q + 1 position to expand since the left-most atom has been dropped, and hence 
it produces N{c — 1,<7 -|- 1) generalization steps until it is finally dropped, and 
so on. The next lemma can be proved by induction on c and s. 

Lemma 3. N{c, s) > — 1 for c > 2 and s > 0. 

It remains to show that this is a proper chain. First, we investigate key 
structural properties of the clauses participating in our chain. The following 
three lemmas can be proved by induction on the updates of di. 

Lemma 4. Let Vars{p) he the variables occurring in the atom p. For all 
di = pi,..,pk the following properties hold: (1) Every atom pj € di contains 
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no repeated occurrences of variables, with the exception of z, which appears at 
least twice in each atom. (2) Vars{pj) 3 Vars{pj+\) for all j = 1, fc — 1. 

From the properties stated in the previous lemma, it follows that we can 
view any clause di as a sequence of blocks of atoms Bi,B 2 , -Bm such that all 
the atoms in a single block contain exactly the same variables, and variables 
appearing in neighboring blocks are such that Vars{Bj) D Vars{Bj+i). 

Lemma 5. Fix some clause di, and letp be an atom in any block B. Ifp-9 € B, 
then 0 does not change variables in p. 

Lemma 6. Let di be any clause in the sequence and let B\, ..,Bm be its blocks. 
Then, for any pair of blocks Bi.^ and Bi^ s.t. i\ < t 2 , there exists some variable 
in Vars{Bi.^) \ {z} that is in the same position j in all the atoms in Bi.^ but in 
all the atoms in Bi.^ it appears in different positions, always different from the 
one in Bi.,^. Moreover, all the atoms in Bi^ contain the variable z at position j. 

Lemma 7. Fix some clause di = pi, ..,Pk with at least 2 atoms (i.e., k > 2). 
Then {p 2 , -nPk) • 0 C di only if 9 does not change variables in p 2 . 

Proof. Let di = B\, .., B^- Let p be any atom in any block Bj and assume 
p-9 G di. Notice that p-9 ^ Bi, .., Bj-i since atoms in blocks Bi, Bj_i contain 
strictly more variables than p ■ 9. Hence p ■ 9 G Bj, .., Bm. We first claim that if 
9 does not change variables in Bj+i, .., Bm then 9 does not change variables in 
Bj. To prove the claim note that ii p ■ 9 G Bj then Lemma 5 applies. On the 
other hand li p ■ 9 G Bj> for j < j' then Lemma 6 guarantees that there exists 
some variable x G Vars(Bji) that appears in a position in p in which atoms in 
Bji contain the variable z. Since by assumption 9 does not change this variable 
this implies that p ■ 9 ^ Bj' leading to a contradiction. 

Finally if {p 2 , ■■,Pk) ■ 9 C di then we have an atom p as above from each 
block. Therefore we can apply the claim inductively starting with j = m and 
until j = i 2 where Z 2 is the block index of p 2 . This implies that 9 does not 
change variables that appear in the leftmost block of p 2 , ..,Pk, and hence in p 2 
as required. □ 



Lemma 8. For all z = 1, ..,n — 1 we have that di >- di+\. 

Proof. Suppose that di = pi,..,pk. We have the following possible transitions 
from di to di+i: 

Case 1. dj+i = p 2 ,--,Pk- Clearly, d{ D di+i, and hence di ^ di+i via the 
empty substitution. Suppose by way of contradiction that di ^ dz+i, so there 
must be a substitution 9 s.t. di ■ 9 C di+i. Clearly, i + 1 ^ n since otherwise 
we could not satisfy di ■ 9 C di+i = 0. Therefore, di+i 0 and di contains at 
least 2 atoms. The fact di ■ 9 C di+i implies that (p2 --jPk) ■ 9 C di, and by 
Lemma 7, 9 must not change variables in p 2 . If pi and p 2 are in the same block, 
then Pi ■ 9 = Pi ^ di+i. If pi and p 2 are in different blocks, then Lemma 6 
guarantees that for every atom in p2,..,Pk there is a variable that appears in 
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a different location in p\ and as above this variable cannot be changed by 9. 
Hence, pi ■ 9 ^ di+i, contradicting our assumption that di ^ di+i- 

Case 2. di+i = p'l, ■■,p'i^i,P 2 i --iPk- Let x be the newly introduced variable. 
Then, di+i ■ {x z} C di and hence di ^ di+i- To see that di di+\, suppose 
that this is not the case. Hence, there must be a substitution 9 such that di-9 C 
di+i- If di = pi, (i.e., di contains one atom only), then pi • 9 C p'^, In this 

case, 9 must map z into the new variable x but this results in multiple occurrences 
of X, and hence pi • 9 % p[, ■■,p'^.i- Hence, di must contain at least two atoms 
and the substitution 9 must satisfy that {p\, ..,pk) ■ 9 Q p'l, ..,p'^.,,p 2 , --iPk- The 
new atoms p'^, ■■,p'^.r contain more variables than pi, ..,pk, therefore (pi, ..,Pfc) • 
^ C p 2 , ■■,Pk- By the same reasoning as in the previous case, we conclude that 
di >- di+i- □ 



Now Lemma 3 and Lemma 8 imply: 

Theorem 2. Let p he a predicate symbol of arity a > 1. There exists a proper 
subsumption chain of length (“) of function free clauses using at most a variables 
and I literals. 

The construction above can be improved to use predicates of arity 3 as follows. 

Definition 1. Let d be any clause. Let Trans{d) he the clause obtained by re- 
placing each literal p{t\, ..,to) with a new set {p(j/i, j/i+i, tj) | 1 < i < a} , where 
all 7/1, .., 7/a+i are new variables not appearing in d. The new variables yi , .., Po+i 
are different for each atom in d. 



Example 2. The clause p{z, xi,X 2 , z),p{z, z, xi, z) is transformed into the clause 
p{yi,y 2 ,z), p{y2,y3,xi), p{yz,yA,X 2 ), p{y4,y5,z), p{y[,y' 2 ,z), p{y' 2 ,y'^,z), 
P(y3>2/4.a;i), p(j/4,7/5,z). 

Consider a function free clause d with predicate symbols of arity at most a, 
containing v variables and I literals. Then, Trans{d) uses predicates of arity 3, 
has l(a+ 1) + V variables and al literals. The next lemma gives the main property 
of this transformation: 

Lemma 9. Let d\,d 2 be clauses. Then, d\ < d 2 iff Trans{d\) < Trans{d 2 ). 

Proof. Assume first that di ^ d 2 , i.e., there is a substitution 9 from variables in 
di into terms of c ?2 such that d\- 9 Q d 2 . Obviously, 9 does not alter the value of 
the new variables added to Trans(di), and hence Trans{di)-9 = Trans{d\-9) C 
Trans{d 2 ), so that Trans{d\) < Trans{d 2 ). 

For the other direction, assume that there exists a substitution 9 such that 
Trans{d\) ■ 9 QTrans{d 2 ). Let d\ = l\\/ l\\/ ..\/ l\^ aadlet {y{, ..,y^ . j } 

OiTtLyyl-^ j + i 

be the variables used in the transformation for literal l\ in d\, for 1 < j ki. 
Similarly, let ^2 = l\yl 2 '^ ..yl\^ and let {y'\, .., Le the variables used 

in the transformation for literal I 2 in d 2 , for 1 < j < ^ 2 . First we show that 9 must 
map blocks of auxiliary variables in Trans(di), {y{,..,y^ blocks 
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of auxiliary variables in Trans{d 2 )^ {y 



fj' 

1 






fj' 

arity{l2 ) + l 



} 



so that the predicate 



symbol of l{ coincides with the predicate symbol of I 2 ■ Moreover, the order of the 
variables is preserved, i.e., 6 maps each y{ 1 — y'{ , for all 1 < i < arity{l^ ). By 
way of contradiction, suppose that there exists a pair of variables in Trans{di), 
y{ and y{_^_i, that have been mapped into and y'J, respectively, where a ^ b. 
Then, p{yj ,yf^i,*) ■ 9 = G Trans{d, 2 )- This contradicts the fact 

that, by construction, all literals in Trans{d 2 ) are such that the superscripts of 
the first two auxiliary variables coincide. 



Suppose now that some yl has been mapped into y'l, where i ^ i' and i is 
the smallest such index. Assume also that the predicate symbol corresponding 
to literal is p. li i > 1, then p(yi_i,yf,*) ■ 9 = p{y'l_i,y'l, ,*) G Trans{d 2 )- 
But this is a contradiction since all literals in Trans{d 2 ) are such that its two 
initial arguments have the form p{y\, *) and here (z — 1) + 1 yf z'. If i = 1, 

then since z' > 1 there must be an index hs.i. 9 maps !->■ y'lu^^ but 9 does 



not map 1 — >■ z/^Tft,+i- Thus we arrive to the same contradiction as in the 

previous case. 

Now, the fact that each yl 1 — y'\ implies that 9 maps arguments of literals 
in d\ into arguments in the same position of literals in c? 2 . Moreover, since blocks 
of variables are not mixed, all arguments from a literal in d\ are mapped into 
all the arguments of a fixed literal in d 2 , so we conclude that d\ ■ 9 ^ d^- □ 



Theorem 3. If there is a predicate symbol of arity at least 3, then there exist 
proper subsumption chains of length at least of function free clauses using 

at most V variables and | literals, where v > 9. 

Proof. Theorem 2 shows that there exists a chain of length (“) = (^^ 2 ) ^ 

2'/"/^ if we use predicate symbols of arity ^/v, variables and ^ atoms per 
clause. Consider the chain Trans(di) >- Trans{d 2 ) >-■■>- Trans(dn). Lemma 9 
guarantees that this is also a proper chain. The chain has clauses with + 

1) + ^/v = I + < V variables (here we use u > 9) and \/v^ = | literals. □ 



4 Learning from Membership Queries Only 

The previous result suggests that simple use of minimal refinement steps may 
require long time to converge. We next use a related construction to show that 
there can be no polynomial algorithm that properly learns the class of monotone 
function-free and length-bounded clauses from membership queries only. We use 
a combinatorial notion, the teaching dimension [2,10], that is known to be a 
lower bound for the complexity of exact learning from membership queries only. 



Definition 2. The teaching dimension of a class 'T is the minimum integer d 
such that for each expression f € T there is a set T of at most d examples (the 
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teaching set ) with the property that any expression g € T different from f is not 
consistent with / over the examples in T. 

Let k be such that log 2 k is an integer. Then {ti, tk) denotes the term rep- 
resented by a complete binary tree of applications of a binary function symbol / 
of depth log k with leaves ti, tk- For example, (1, 2, 3, 4, 5, 6, 7, 8) represents the 
term /(/(/(I, 2), /(3, 4)), /(/(5, 6), /(7, 8))). Notice that the number of distinct 
terms in {ti, ..ffk) is at most k + NTermsfti). In particular, if each ti is 
either a variable or a constant, then NTerms{{ti, ..ffk)) "£ 2fc. 

Let p be a unary predicate symbol. Consider the clause p((a, a)), where the 
constant a occurs k times. We consider all the possible minimal generalizations 
of p{{a , .., a)). That is, clauses C that are strict generalizations of p((a, .., a)) for 
which no other clause C is such that p{{a,..,a)) C C . Among them we 
find the clauses 

Ck =p{{x,..,x)) 

Ck-i = p{{a,x, ..,x)) V p{{x,a,x, ..,x}) V ..Vp((x, -,x,a}) 

Ck -2 = p((a,a,x, ..,x}) V p((a,x,a,x..,x)) V .. V p((x, ..,x,a,a)) 

Cfc /2 = P((a,-,a,x,..,x}) V .. V p((x, x, a, a}) 



Cl = p((a , .., a, x)) V p((a , .., a, x, a)) V .. V p((x, a , .., a)) 



where each Gi includes all possibilities of replacing i positions with a variable. 
Clearly, \Cf\ = (^). In particular, |Cfe/ 2 | = (^,/ 2 ) > 2 '"^. 

We next define the learning problem for which we find an exponential lower 
bound. The signature S consists of the function symbol / of arity 2, two constants 
a, 6, and a single predicate symbol p of arity 1. Fix I to be some integer. Let 
the (representation) concept class be C = {first-order monotone 5-clauses with 
at most I atoms} and the set of examples be 5 = (first order ground monotone 
5-clauses with at most I atoms}. 

We identify the representation concept class C with its denotations in the 
following way. The concept represented byC G C is {E G S \ C \= E} which 
in this case coincides with {E G E \ C ^ E}. Thus, this problem is cast in the 
framework of learning from entailment. 

Suppose that the target concept is / = p((a, ..,a)) and that I < We 

want to find a minimal teaching set T for /. The cardinality of a minimal teaching 
set for / is clearly a lower bound on the teaching dimension of C. By definition, 
the examples in T have to eliminate every other expression in C. In other words, 
for every expression g inC other than /, T must include an example E such that 
f d: E and g E or vice versa. 



We first observe that the clause Ck /2 is not included in our concept class 
C because it contains too many literals: I < = '—EBl < | (7^/2 1. However, 
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subsets of Ck /2 with exactly I atoms are included in C because they are monotone 
5-clauses of at most I literals. Note also that each clause includes < 2kl terms. 
There are K = such subsets where we use an additional 

restriction that I < k. Let these be ^k/ 2 - definition, the teaching set 

T has to reject each one of these K clauses. 

Notice that ^ / = p{{a, a)) for each j = 1, .., K (consider the witness- 
ing substitution {x i— a}). Now, to reject an arbitrary C'^/ 2 > ^ include 

some example E G £ s.t. ^ E but p{{a,..,a)) E. The only example in 

E that qualifies is E^ = C ^^2 ■ {x b}. Hence, for each the example E^ 
must be included in T and these examples are distinct. Hence, T must contain 
all the examples in E'^ , ,.,E^ . Substituting k = and I < ^ so that 2kl < t 
we obtain: 

Theorem 4. Let C he the class of monotone clauses built from a signature con- 
taining 2 constants, a binary function symbol and a unary predicate symbol with 
at most I < ^ literals and t terms per clause. Then, the teaching dimension of 
C is 



5 On the Number of Pairings 



Plotkin [16] (see also [15]) defined the least general generalization (Igg) of clauses 
w.r.t. subsumption and gave an algorithm to compute it. The Igg of Ci, C 2 is a 
clause C that subsumes both clauses, namely C < C\, C < C 2 , and is the least 
such clause, that \s D < C for any D that subsumes both clauses. The algorithm 
essentially takes a cross product of atoms with the same predicate symbol in the 
two clauses and generalizes arguments bottom up. Generalization of arguments is 
defined as follows. The Igg of two terms /(si, ..., s„) and g{t\, ..., tm) is the term 
f{lgg{si,t\), ..., lgg{sn,tn)) if / = and n = m. Otherwise, it is a new variable 
X, where x stands for the Igg of that pair of terms throughout the computation 
of the Igg. This information is kept in what we call the Igg table. 



Example 3. Let Ci = {p{a, f{b)),p{g{a,x),c),q{a)} and C 2 = {p{z, f{2)),q{z)}. 
Their pairs of compatible literals are {p{a,f{b)) — p{z, f{2)), p{g{a,x),c) — 

P{z, /(2)), q{a) - q{z)}. Their Igg is lgg{Ci,C 2 ) = {p{X, f{Y)),p{Z, V),q{X)}. 
The Igg table produced during the computation of lgg{C\,C 2 ) is 



[ a - z => X 1 

[ 6 - 2 => r ] 

[ /(6) - /(2) => /(F) ] 
[ g{a,x) - z => Z 
[ c - /(2) => y ] 



(from p{a, f{b)) with p{z, /(2))) 
(from p{a, f{b)) with p{z, f{2))) 
(from p{a,f{b)) with p{z,f{2))) 
(from p{ g{a,x) , c) with p{z, /(2))) 
(from p{g{a,x),c) with p{z,f{2))) 



The number of literals in the Igg of two clauses can be as large as the product 
of the number of literals in the two clauses and repeated application of Igg can 
lead to an exponential increase in size. Pairings are subsets of the Igg that avoid 
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this explosion in size by imposing an additional constraint requiring that each 
literal in the original clauses is paired at most once with a compatible literal of 
the other clause. In Example 3, we have the literal p(z, /(2)) G C 2 paired to the 
literals p{a,f{b)) and p{g{a,x),c) of Ci. A pairing disallows this by including 
just one copy in the result. Naturally, given two clauses we now have many 
possible pairings instead of a single Igg. 

Pairings are defined in [13,3] by way of matchings of terms. Notice that the 
first two columns of the Igg table define a matching between terms in the two 
clauses. In our example this matching is not 1-1 since the term /(2) in C 2 has 
been used in more than one entry of the matching, in particular, in entries 
f{b) — /(2) and c— /(2). This reflects the fact that the atom p{z,f{2)) of C 2 is 
paired with two atoms in Ci in the Igg. Every 1-1 matching corresponding to a 
1-1 restriction of the Igg table induces a pairing. 



5.1 General Clauses 

We first show that general clauses allowing the use of arbitrary terms can have 
an exponential number of pairings. Fix v such that log2 v is an integer. Let tij 
be a ground term that is unique for every pair of integers 0<f,j<u— 1. For 
example, tij could use two unary function symbols /o and fi and a constant a 
and we define tij as a string of applications of /o or fi of length 2 log v, finalized 
with the constant a such that the first log v function symbols encode the binary 
representation of i and the last logu function symbols encode j. For example, 
if u = 8, then the term ^5^3 can be encoded as /i(/o(/i(/o(/i(/i(a))))))- The 

5 3 

size of such a term (in terms of symbol occurrences) is exactly 21ogu -I- 1. Let 
xq, ..,Xy-i and yo , .., j/„_i be variables. We define 

Ci= \f p{tij,xi,xi+i) 

0<ij <v 

0<Kv-l 

C2= V 

0^i,j <v 

Notice that jCij = v“^{v — 1) and \C 2 \ = and they use a single predicate 
symbol of arity 3. 

Any 1-1 matching between the variables in C\ and C 2 can be represented by 
a permutation tt of {0, .., v — 1}: each variable Xi in C\ is matched to yTr{i) in C*2. 
All the matchings considered in this section map the common ground terms of 
Cl and C 2 to one another, i.e., the extended matchings also contain all entries 
[t - t => t] , where t is any ground term appearing in both Ci and C 2 ■ Let the 
extended matching induced by permutation tt be 

{xi — yTr{i) I 0 < i < V — l} U{t — t ^ t \ t G Terms{Ci) fl T erms{C 2 )'\ ■ 

First we study lgg-K{Ci,C 2 ), the pairing induced by the 1-1 matching repre- 
sented by TT. A literal p{tij , Xa, Xi) is included in lggTr{C\,C 2 ) iff a = tt{1) 
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and b = tt{1 + 1) for some I G {0, — 2} (this is the condition imposed 
by Cl), and i = a,j = b (this is the condition imposed by C 2 ). Therefore, 

^99Tr{Cl, C 2 ) = Vo<i<ri — 1 > ^'"■(0 ’ 

Finally we see that different permutations yield pairings that are subsump- 
tion inequivalent, i.e., 199 tt{Ci,C 2 ) C 2 ) for any tt ^ tt' . It is sufficient 

to observe that since tt and tt' are distinct, there must exist some term 
in lgg-K{Ci,C 2 ) that is not present in ^^(/^'(Ci, C 2 ). This holds since a distinct 
pair of consecutive indices exists for any two permutations. Since the terms 
are ground, subsumption is not possible. There are v! distinct permutations of 
{0, .., u — 1} and therefore: 

Theorem 5. Let S be a signature eontaining a predicate symbol of arity at least 
3, two unary function symbols and a constant. The number of distinct pairings 
between a pair of S-clauses using v variables, 0{v^) literals and terms of size 
O(logu) can be I7(v!). 

5.2 Function Free Clauses 

We next generalize the construction to use function free clauses. We start with 
a construction using non-fixed arity. Our construction mimics the behavior of 
pairing ground terms in the previous section by using 2 additional variables, Zq 
and Zi, that encode the integers i and j in a similar way to tij. By looking at 
matchings tt that match the variables Zq and Z\ to themselves, we guarantee that 
the resulting IggT^ contains the correct encoding of the variables in the last and 
previous-to-last positions of the atoms. Let 

Cl= V p{z^„..,Z^,^^^,ZJ^,..,Zj^^^^,Xl,Xl + l) 

(n...,Uog«)e{o,i}i°g" 

0<i<'u-l 

C*2 = V p{Zi^ , , y^, Vj) 

= binary(i) 

UlwJlogv) = binary(j) 

0<i,j <v 

where we use binary (ji) to denote the tuple , .., Zm^gv encoding n in its binary 
representation using zq,Zi. For example, assuming v = 8, binary(6) = Zi,zi,zo- 
Notice that |Ci| = v‘^(v — 1) and IC 2 I = the clauses use a single predicate 
symbol of arity 21ogw -I- 2, and both clauses use exactly v + 2 variables. 

Any 1-1 matching between the variables xq, in Ci and yo, in 

C 2 can be represented by a permutation tt of {0, .., w — 1}: each variable Xi in Ci 
is matched to yTr(i) in C 2 - Let the matching induced by permutation tt be 

{xi y7r{i) ^ ^7r(i) | 0 ^ f ^ • 

First we study lggTru{zo-zo,zi-zi}{Ci,C 2 ), the pairing induced by the 1- 
1 matching represented by tt augmented with zq and zi matched to them- 
selves. A literal p{zi^, Zj.,, Xa, Xj,) is included in lgg^{Ci,C 2 ) 
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iff a = n{l) and b = n{l + 1) for some I G {0,..,v — 2} (this is the condi- 
tion imposed by Ci), and (ii, iiogt,) = binary{a), {ji, jiogv) = binary{b) 
(this is the condition imposed by C 2 }- Therefore, ?(/ 57 ru{zo-zo, 2 i-zi}(^i’ ^ 2 ) = 
\/o<i<v-iPibinary{Tr{l)),binary{Tr{l + 1 )), X^p+i)). 

Finally, we want to check whether different permutations yield pairings that 
are subsumption inequivalent, i.e., if for any tt yf tt' 

^99ttU{zo — zo,zi — zi }(C'i,C'2) ^ 199tt'u{zo — Zo,Zi—Zi }(Cl,C2). 

To this end, we investigate which substitutions 0 satisfy 

^99ttu{ Zo — Zo,Zi—Zi }((7i,C^2) * ^ ^ ^997t'U{zo — Zo,Zi—Zi }(Cl,C2). 

If 9 does not change the values of zq,zi, then as before some atom 

p{binary{'K{l)) , binary{Tr{l -|- 1 )), *, *) • 0 = p{binary{TT{l)) , binary{Tr{l + 1 )), *, *) 

in l997ru{zo-zo,zi-zi}{Ci,C2) ■ 9 does not occur in l99^'u{zo-zo,zi-zi}{Ci,C2). 
If 9 maps both variables to the same value (either zi or zq), then inclu- 

sion cannot happen since ^5g7r'u{zo-zo.zi-zi}(^ij ^ 2 ) contains no atoms of the 
form p{zq, zo, *, *) or p{zi, zi, *, *)• Obviously, if Zq or z\ are mapped into 
any other variable AT*, then the inclusion is not possible either. Hence, 9 must 
exchange the values of Zq,zi, and: 

p{binary{'K{l)) , binary{Tr{l -|- 1)), *, *) • 0 = p{binary{T:{l)) , binary{Tr{l + 1)), *, *) 

where binary (n) is the “complement” of binary (n). For example, assuming v = 8, 
binary{6) = zo,zq,zi. More precisely, binary{n) = binary{v — 1 — n). We have 
seen that there is only one permutation tt' = W for which there exists some 9 
s-t- 199t:u{zo-zo,zi-zi}{Ci,C2) ■ 9 C lgg^>u{zo-zo,zi-zi}{Ci,C2). Moreover, 9 is 
exactly {zq zi,zi zq} U {Xi >->• Xy-i-i | 0 < ^ < u}. We therefore get: 

Theorem 6. Let S he a signature eontaining a predicate symbol of arity at least 
21ogr: -I- 2. The number of distinct pairings between a pair of function free S- 
clauses using v + 2 variables, 0{v^) literals can be f2{vl). 

As in the previous section we can generalize the result to use predicates with 
fixed arity: 

Theorem 7. Let S be a signature containing a predicate symbol of arity at least 
3. The number of distinct pairings between a pair of function free S -clauses using 
at most V variables and v literals can be f2{2‘"^^). 

5.3 Implications for Learnability 

The Algorithms in [13,3] are shown to learn first order classes from equivalence 
and membership queries. The algorithms use pairings in the process of learn- 
ing and a t^ upper bound on the number of these is used. No explicit lower 
bound was given leaving open the possibility that better analysis might yield 
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better upper bounds. The results above can be used to give a concrete example 
where an exponential number of queries is indeed used. We sketch the details 
here for the algorithm in [ 3 ]. Let the target be T. The algorithm maintains 
a set of meta-clauses as its hypothesis. Two major steps in the algorithm are 
minimization and pairing. In minimization, given a counter example clause C 
s.t. T \= C the algorithm iterates dropping one object at a time and asking an 
entailment membership query to check whether it is correct. For example given 
p{xi,X2),p{x2,X3),p{xi,X3),p{x3,X4) q{x3,X3) dropping X2 (and all atoms 

using it) yields p{xi,X3),p{x3,X4) — >■ q{x3,X3). In this way a counter example 
with a minimal set of variables is obtained. Then the algorithm tries to find a 
pairing of the minimized example and a meta-clause in the hypothesis which 
yields an implied clause of smaller size. This is done by enumerating all “basic” 
pairings. If no such pairing is found then the clause is added as a new meta- 
clause to the hypothesis. Therefore in order to show that the algorithm makes 
an exponential number of queries it suffices to show a target T = D\ A D2 where 

( 1 ) each of Di, D2 is already minimal so that minimization does not alter them, 

( 2 ) Di,D2 have an exponential number of “basic” pairings, and ( 3 ) T ^ C for 
any C which is a pairing of Di,D2- If this holds then we can give the clause Di 
to the algorithm as a counter example and then follow with D2- The algorithm 
will ask a membership query on all the pairings getting an answer of No every 
time and eventually add D2 to its hypothesis. We omit the technical definition 
of “basic” pairings but note that all pairings constructed in the previous section 
are “basic” since they map variables to variables. 

Let /() be a nullary predicate symbol, and q() and r() binary predicates. 
Let Ni , N2 be the number of variables used in Ci , C2 in the construction above 
respectively, and rename these variables (in any order) so that Ci uses variables 
vi, . . . ,vni, and C2 uses variables , . . . , wn^ ■ Then we use g() and r() to define 
chains of variables touching all variables in C\,C2' Q = /\i<i<Niq{vi,vi+i) and 
R = Now define C[,C'2 to be the conjunction of the atoms 

from Cl, C2 above (we used disjunction above) and let Z?i = C( A Q — >■ /() and 
D2 = C2 A i? — >■ /() . Finally T = Di A £>2. Since a linear chain formed as in Q, i? 
never subsumes any subchains and since Q, R use different predicate symbols we 
can show the following: 

Lemma 10 . Let T = D\ A D2 as defined above and let C be any clause. 

(1) Di D2 and hence D\ Y= £*2 D\ and hence D2 D\. 

(2) IfT\=C then it is the case that either Di C or £>2 ^ C. 

(3) If D is a result of dropping any object from Di or D2 then T Y= D. 



Theorem 8 . The algorithm of [3] can make 17 ( 2 "/'*) queries on some targets. 
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Abstract. We consider the polynomial time learnability of ordered tree 
patterns with internal structured variables, in the query learning model 
of Angluin (1988). An ordered tree pattern with internal structured vari- 
ables, called a term tree, is a representation of a tree structured pattern 
in semistructured or tree structured data such as HTML/XML files. 
Standard variables in term trees can be substituted by an arbitrary tree 
of arbitrary height. In this paper, we introduce a new type of variables, 
which are called height-bounded variables. An i-height-bounded variable 
can be replaced with any tree of height at most i. By this type of vari- 
ables, we can define tree structured patterns with rich structural features. 
We assume that there are at least two edge labels. We give a polynomial 
time algorithm for term trees with height-bounded variables using mem- 
bership queries and one positive example. We also give hardness results 
which indicate that one positive example is necessary to learn term trees 
with height-bounded variables. 



1 Introduction 

Semistructured data such as HTML/XML files have no rigid structure but have 
tree structures. Such semistructured data are called tree structured data, in 
general. Tree structured data are represented by rooted trees t such that all 
children of each internal vertex of t are ordered and t has edge labels [1]. In the 
fields of data mining and knowledge discovery, many researchers have developed 
techniques based on machine learning for analyzing tree structured data, and 
some types of tree structured patterns have been proposed. For example, an 
ordered term tree [6,7,8,10], simply called a term tree, can directly represent not 
only associations but also structural relations between substructures common 
to tree structured data, because a term tree is a tree structured pattern with 
ordered children and structured variables which can be substituted by arbitrary 
trees. A variable in a term tree t has a variable label and is represented by a 
list [mq, Ui, . . . ,U(] {£ > 1) where Uq is an internal vertex of t and Ui, . . . ,ui are 
consecutive children of uo. A variable can be replaced with an arbitrary tree T 
so that the root of T is identified with uq and i leaves of T are identified with 
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ui, . . . ,U£. We say that a variable [uq, ui, ■ • • , ui] is a multi- child-port variable if 
^ > 2, a single- child-port variable if £ = 1. For example, in Fig. 1, we give the 
term tree s such that T 2 and T3 shown in Fig. 1 are obtained from s by 
replacing variables labeled with “x”, “y” and “z” with appropriate trees. 

In [11], we introduce a new kind of variables, called {i,j) -height- constrained 
variables (1 < i < j), to a term tree, in order to present a tree structured 
pattern which can also have distance relations between substructures common 
to tree structured data. An (i, j)-height-constrained variable [mo,mi, . . . ,w^] can 
be replaced with a tree T if the following two conditions hold: (1) the minimum 
depth of £ leaves of T which are identified with ui,...,U£ is at least i, and 
(2) the height of T is at most j. For example, in Fig.l, the tree can be 
obtained from t by replacing the (l,l)-height-constrained variable having label 
“x(l,l)”, the (2,3)-height-constrained variable having label “y(2,3)” and the 
(l,2)-height-constrained variable “z(l,2)” of the term tree t with the trees gi, 
g 2 and 33, respectively. However, neither the trees Ti nor T 2 given in Fig. 1 can 
be obtained from t. By using term trees having height-constrained variables as 
tree structured patterns and setting appropriate values for height-constrained 
variables, we can design data mining tools which can extract rare interesting 
substructures from tree structured data. For any j > 1, we call a (1, J)-height- 
constrained variable a j -height-bounded variable. In this paper, we consider the 
learnabilities of some classes of term trees having height-bounded variables. 

We say that a term tree t is linear (or regular) if all variable labels in t are 
mutually distinct. For a set of edge labels A, the term tree language of a term 
tree t, denoted by LA{t), is the set of all edge-labeled trees which are obtained 
from t by replacing substitutable trees for all variables in t. For a linear term 
tree t, we say that L/i(t) is a linear term tree language of t. Let OTT^ be the set 
of all linear term trees each of whose variables is either a height-bounded multi- 
child-port variable or a height-bounded single-child-port variable. We denote by 
OTT^’^ the set of all linear term trees in OTT^ each of whose variables is a 
height-bounded single-child-port variable. In this paper, we show that the classes 
CJTT^’^ and OVT^ are learnable in polynomial time in the exact learning model 
of Angluin [3]. In this model, a learning algorithm is said to exactly identify a 
target r» if the algorithm outputs a term tree r such that LA{r) = LA{rA) and 
halts, after it uses some queries and additional information. In order to show 
the exact learnabilities of the classes OTT^’^ and OTT^, we give algorithms 
which exactly identify any term tree in OVT a ^ OTT^ in polynomial time 
by using membership queries and one positive example. 

As our previous works, in [6], we showed that any finite set of linear term 
trees having single-child-port variables only is exactly identifiable in polynomial 
time using queries. Moreover, we showed that any finite set of nonlinear term 
trees having single-child-port variables only is exactly identifiable in polynomial 
time using queries [8]. In [10,12], we showed the class of linear term tree lan- 
guages is polynomial time inductively inferable from positive data for any A 
with jAj > 1. Further, we proposed a tag tree pattern, which is an extension 
of a term tree, as a tree structured pattern in semistructured data and gave a 
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Fig. 1. Trees T\,T 2 ,Tz, term trees s and t having height-constrained variables. A vertex 
and an edge are denoted by a circle and a line, respectively. A variable is denoted by 
a square connecting to all vertices in the list which represents the variable. Notations 
X and x{i,j) in squares show that the square is a variable with a variable label x and 
an (i, ji)-height-constrained variable with a variable label x, respectively. 



data mining method from semistructured data, based on a learning algorithm for 
term trees [9]. In [11], we showed that a subclass of OTT^’^ is polynomial time 
inductively inferable from positive data for any yl with |7l| > 2. As other related 
works, the works [2,4] show the exact learnability of tree structured pattern in 
the exact learning model. 

This paper is organized as follows. In Section 2 and 3, we explain term trees 
and our learning model. In Section 4, we show that the classes CJTT^’^ and 
OTT^ are exactly learnable in polynomial time using membership queries and 
one positive example. In Section 5, we give hardness results which indicate that 
one positive example is necessary to identify a target term tree. 
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2 Preliminaries 

Let T = {Vt^ Et) be an ordered tree with a vertex set Vr and an edge set Et- 
A list h = [mo, ui, . . . ,Ui] of vertices in Vr is called a variable of T \i u\, . . . ,U(, 
are consecutive children of ug, i.e., ug is the parent of ui, . . . ,U£ and uj+i is the 
next sibling of Uj for any j with 1 < j < £■ We call Ug the parent port of the 
variable h and ... ,ui the child ports of h. Two variables h = [ug,ui, . . . ,Ui] 
and h' = [mq, u'l, . . . , u^,] are said to be disjoint if {u\, . . . , ui}C\{u'i, . . . , u'^,} = 0. 
For a set or a list S, we denote by |5| the number of elements in S. 

Definition 1 (Ordered term trees). Let T = {VttEt) be an ordered tree 
and EIt a set of pairwise disjoint variables of T. An ordered term tree ob- 
tained from T and Ht is a triplet t = {Vt, Et, EIt), where Vt = Vr, Et = 

€ Et \ I < i <£} and Ht = Hr. For two 
vertices u, u' G Vt, we say that u is the parent of u' in t if t6 is the parent of u' 
in T. Similarly we say that u' is a child of u in f if u' is a child of u in T. In 
particular, for a vertex u G Vt with no child, we call u a leaf of t. 

We define the order of the children of each vertex u in t as the order of the 
children of u in T. We often omit the description of the ordered tree T and 
variable set Hr because we can find them from the triplet t = (Vt,Et,Ht). We 
denote by |t| the number of vertices in t. 

For any ordered term tree t, a vertex u of t, and two children u' and u" of u, 
we write u' u" if u' is smaller than u" in the order of the children of u. We 
assume that every edge and variable of an ordered term tree is labeled with some 
words from specified languages. A label of a variable is called a variable label. A 
and X denote a set of edge labels and a set of variable labels, respectively, where 
AC\ X = f. An ordered term tree t = {Vt,Et, Ht) is called linear (or regular) if 
all variables in Ht have mutually distinct variable labels in X. 

Note. In this paper, we treat only linear ordered term trees, and then we call a 
linear ordered term tree a term tree, simply. In particular, an ordered term tree 
with no variable is called a ground term tree or tree and considered to be a tree 
with ordered children. 

For a term tree t and its vertices V\ and Vt, a path from v\ to Vt is a sequence 
V\,V 2 , ■ . ■ ,Vi of distinct vertices of t such that for any j with f < j <i, Vj is the 
parent of Vj+\. The height of a term tree g is the length of the longest path from 
the root to a leaf, and the height of a vertex u of (/ is the length of the longest 
path from u to a leaf. 

In this paper, we deal with height-bounded variables only, which is a re- 
stricted height-constrained variable. Therefore here we give only the definition 
of height-bounded variables. The general definition of height-constrained vari- 
ables are given in [11]. 

Definition 2 (Height-bounded variables). Let X^ be an infinite subset of 
a variable label set X. For an integer i > I, let A^b) be an infinite subset of 
X^. We assume that X^ = ljj >2 A^b) and A^b) n A^b ) = 0 for i ^ i'. A 
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variable label in is called a height-hounded variable label. In particular, a 
variable label in is called an i -height-hounded variable label. For a variable 

e, we denote by a variable with an i-height-bounded variable label, and call 
it an i-height-hounded variable. For an i-height-bounded variable e, we say that 
the value i is the size of the variable e, and denote it by size{e). 

Let / = (Vf,Ef,Hf) and g = {Vg, Eg, Hg) be term trees. We say that / and 
g are isomorphic, denoted hy f = g, if there is a bijection (p from Vf to Vg such 
that (i) the root of / is mapped to the root of g by p, (ii) {u, v} G Ef if and only 
if {lp{u) , ip{v)} € Eg, {m,u} and {ip{u) , Lp{v)} have the same edge labels, (iii) for 
* > 1) [wo, ui, ■ • ■ , G Hf a and only if [(/?(t 6 o), 'p{ui), . . . , G Efg, and 

(iv) for any internal vertex u in f which has more than one child, and for any 
two children u' and u" of u, u' <(^ u" if and only if p{u') 

Definition 3 (Substitutions). Let / be a term tree with at least l+ \ vertices, 
X a variable label in for z > 1, and g a tree with at least t -\- 1 vertices. 

Let h = [vq,vi, . . . ,vi] be a variable in / with the variable label x and a = 
[uq,ui, . . . ,ui] a list of £ -I- 1 distinct vertices in g, where uq is the root of g and 
ui, . . . ,U£ are leaves of g. The form x := [g, u] is called a binding for x if the 
height of g is at most i. 

A new term tree /' = f{x := [ 5 ,cr]} is obtained by applying the binding 
X := [g, a] to f in the following way. Let e = be an z-height- 

bounded variable in / with the variable label x. Let g' be one copy of g and 
W(i,wi, . . . ,W£ the vertices of g' corresponding to uq,u\, . . . ,ui oi g, respectively. 
For the variable e = [vo,z;i, . . . we attach g' to / by removing the vari- 

able e from Elf and by identifying the vertices vo,v\, . . . ,V£ with the vertices 
Wq, Wi, . . . ,wi of g' , respectively. We define a new ordering <l on every vertex 
v in /' in the following natural way. Suppose that v has more than one child and 
let v' and v” be two children of v in f . We note that Vi = Ui for any 0 < z < ^. 
(1) If v,v',v" G Vg and v' <g v" , then v' <(, v” . (2) If v,v',v" G Vf and 
v' <l v", then v' <(,' v" . (3) If z; = uo(= zzo), v' G Vf - {zzi, . . .,ve}, v" G Vg, 
and v' <l vi, then v' <(' v" . (4) If w = vq(= uq), v' € Vf — {vi, ... , ve}, v" G Vg, 
and vi <l v' , then v" <(, v' . A substitution 0 is a finite collection of bindings 
{x\ := [gi,ai], - ■ ■ , := [(/„, cr„]}, where Xi’s are mutually distinct variable la- 

bels in and gi’s are trees. The term tree fO, called the instance of / by 9, 
is obtained by applying the all bindings Xi := [gi,ai] on / simultaneously. The 
root of the resulting term tree fO is the root of /. 

For example, let T 4 be a term tree and 9 = {xi := \Tr,,[ui,wi]],X 2 '.= 
[Te, [zz 2 , ZC 2 ]], a ;3 := [T 7 , [zts, zus, ZC 4 ]]} a substitution, where T 5 , Tg and Ty are 
trees in Fig. 2. Then, the instance t'9 of the term tree t' by 9 is the tree T 4 . 

We write f di g V there exists a substitution 9 with / = g9. It f d: g and 
f ^ g, then we write f ^ g. We define the size of the term tree t as the sum of 
the number of vertices and the size of variables in t, and denote it by size(t). 
More precisely, for a term tree t = {Vt,Et,Ht), 

size{t) = |Ft| +Ei>i* X |{e G iLz I e is an z-height-bounded variable}]. 
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Fig. 2. A square with a notation x(i) shows that the square is an i-height-bounded 
variable with a variable label x. 



By the definition, it is clear that |t| < size(t) for any term tree in which all 
variable have height-bounded variable labels. 

We denote by OT a the set of all trees. We define OTT^ as the set of all term 
trees in which all variables have height-bounded variable labels. The term tree 
language LA{t) of a term tree t G OTT^ is {s G OT n | s ^ t}. In particular, 
we denote by OTT^’^ the set of term trees in OTT^ of which all variable have 
one child port. When it is clear from the context, the notation OTT^ (resp. 
OTT^’^) is abused to stand for the class of languages {Ta(?') | r G OTT^} 
(resp. {LA{r) \ r G OTTy}). 

3 Learning Model 

Let A be a set of edge labels and TT a a class of term trees. In this paper, let 
r, be a term tree in TT a to be identified, and we say that the term tree r* is 
a target. A term tree r is called a positive example of LA{rT) if r is in LA{rT). 
We introduce the exact learning model via queries due to Angluin [3]. In this 
model, learning algorithms can access to oracles that answer specific kinds of 
queries about the unknown term tree language LA{rT). We consider the following 
oracles. 

1. Membership oracle Menir^'. The input is a tree r. The output is “yes^^ if r is 
in LA{rT), and “no” otherwise. The query is called a membership query. 

2. Subset oracle Subr, ■ The input is a term tree r in TT a- The output is yes if 
LA{r) C LA{rT). Otherwise, it returns a counterexample t' G LA{r) — LA{rT). 
The query is called a subset query. 

3. Equivalence oracle Equiv^.^: The input is a term tree r in TT a- If La{t) = 
LAi^t.), then the output is yes. Otherwise, it returns a counterexample t G 
LA{r) U LA{rT) — LA{r) O LA{rT). The query is called an equivalence query. 

A learning algorithm A may collect information La(tT) by using queries and 
outputs a term tree in TT a- We say that a learning algorithm exactly identifies 
a target r, if it outputs a term tree r in TT a with LA{r) = LA{rfi) and halts 
after it uses some queries and additional information. We consider the above 
three oracles for OTTy and OTT^. 
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Fig. 3. A term tree t 2 is obtained by the contraction of the edge from ti. 

Moreover, is obtained by the contraction of the edge {noil's} from t 2 - 



4 Learning Using Membership Queries 



In this section, we show that any ordered term tree in OTT^ is exactly 
identifiable using membership queries and one positive example. In this section, 
we assume that \A\ > 2. 

Let t = (Vt,Et,Ht) be a tree and e = {u, u} an edge in Ef. We define the 
contraction of e to t as the following operation: If v has children v\, . . . then 
the operation removes v from Vt and replaces {u,v}, {v,vi}, . . . , {v,Vi} in Et 
with new edges {u, vi}, . . . , {u, vg}, that is, Et = Et U {{u, ui}, . . . , {m, Vi}} — 
{{m, v}, {u, ui}, . . . , {v, Vi}}. Otherwise, the operation removes v from Vt and e 
from Et- We denote by contraction{t,e) the tree obtained from t by applying 
the contraction of e to t. 

We use contractions to t to decrease the number of vertices in t. For example, 
in Fig. 3, O is the tree obtained by the contraction of the edge {uo;^^ 2 } from ti, 
and fa is obtained by the contraction of the edge from t 2 - 

Lemma 1. Let t he a term tree in CTT^ and t' = {Vf , Et' , Ht') a tree such 
that t' € LA(t). If contraction{t' ,e) ^ for any e G Et', then |t| = |t'|. 

Let t = {Vt, Et, Ht) be a term tree in CJTT^’^ and e = [u,v] a variable in 
Ht- We define the extension of e to t as the following operation: The operation 
replaces the variable e with variables [u,r(;], [iu,u], that is, Vt = Vt A {w} and 
Ht = Ht A {[u, w], [w, w]} — {[u, u]}. The variables are new 1-height-bounded 
variables. We denote by extension{t, e) the term tree obtained from t by applying 
the extension of e to L 

Let t = (Vt,Et,Ht) be a term tree. We call t a chain term tree if Vt = 
{vo,vi,. . .,Vk}, Et = %, Ht = {[uo,^^i], ■ • ■ , [vk-i,Vk\} {k > 1), and uq is the root 
of t. Moreover, we call t a chain tree or k-chain tree if Vt = {vq,vi, . . . ,Vk}, 
Et = {{uq, Ui}, . . . , {ufc_i, Ufc}} {k > 1), Ht = 0, and Vq is the root of t. For a 
fc-chain tree t = (Vt,Et,Ht) and each vertex Vi G Vt {0 < i < fc — 1), we attach a 
(fc — t)-chain tree to both sides of Vi respectively, and call such a tree a k-triangle 
tree. For example, in Fig. 4, t^ is a 3-chain tree and ts is a 3-triangle tree. 

Let t = {Vt, Et, Ht) be a term tree in OTT^'^ . Let S = [ui,Ui-\, . . . ,uq] 
{(- > 1) be a sequence of vertices in t such that [ui, Ui+\] € Ht {i = 0, ...,£— 1) 
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t4 




Fig. 4. t 4 is a 3-chain tree, ts is a 3-triangle tree and Iq is a (3,5)-broom tree. 





tr tg tg tio 

Fig. 5. For S = [m 7 , Me, « 5 , Ms], tg = triangle{t’i-, S), tg = replace{t'Y, S) and tio = 
tree{tg). 



and each vertex m {1 < i < £—1) has only one child. We denote by triangle{t, S) 
a tree obtained from t by replacing variables which consist of vertices in S 
with an f-triangle tree, and other variables with edges. Moreover, we denote 
by replace{t, S) (resp. tree{t)) a term tree (resp. a tree) obtained from t by 
replacing variables which consist of vertices in S (resp. each variable in t) 
with an |S'|-height-bounded variable (resp. an edge). For example, in Fig. 5, 
for S = [ur,ue,U 5 ,U 3 ], we have tg = triangleitr, S), tg = replace{tr, S) and 
tio = tree{tg). 

Theorem 1. The algorithm LEARN _OTTl of Fig. 8 exactly identifies any term 
tree r* € CJTT^’^ in polynomial time using at most 0{n^) membership queries 
and one positive example t, where |T| > 2 and n = max{|t|, st2;e(r*)}. 

Proof. (Sketch) By Lemma 1, we can get a tree t such that |t| = |r*| by using 
the procedure Edge-Contraction in Fig. 6. In the procedure Variable-Extension 
of Fig. 6, we can get a term tree which is isomorphic to a term tree obtained 
from r, by replacing each variable with the longest substitutable chain term tree, 
which consists of 1-height-bounded variables only. 

Let e\* ,62* ,. . . , e[* ,. . . be the sequence of variables in r* by post-order traver- 
sal. For the term tree r outputted by the algorithm LEARN_OTTl, let e\, eg, ■ ■ ■, 
et, . . . be the sequence of variables in r by post-order traversal. We assume that 
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Procedure Edge_ContractioD{t)\ 

Given: The oracle Menir, for a target r* G CJTT^’^ and one positive example t; 

begin 

repeat 

foreach edge e in t do begin 

Let t' := contraction{t, e); 

if Memr,(t') = yes then begin t t'; break; end; 
end; 

until t does not change; 

return t; 
end. 



Procedure Variable-Extension(t ) ; 

Given: The oracle Mem^, for a target r* G OTT^’^ 
and a tree t G LA{rG) with |t| = |r*|; 

begin 

s := t; 

foreach edge Ca in s do begin 

Let s' be a term tree obtained from s by replacing the edge label 
with another edge label; 
if Merttr^ (s') = yes then begin 
Replace the edge et in t corresponding to Cs 
with a new 1-height-bounded variable; 
end; 
end; 
repeat 

foreach variable e in t do begin 

t' := extension{t,e); 
t" := tree(t'); 

if Menir,(t") — yes then begin t := T; break; end; 
end; 

until t does not change; 

return t\ 
end. 



Fig. 6. Procedure Edge-Contraction and Procedure Variable-Extension 



there exists a positive integer i such that szze(e[*) ^ size{el). Let k be the mini- 
mum integer satisfying the above condition, the size of and I the size of e^. 
We assume ^ < L*. Let r' be a term tree obtained from r by replacing each vari- 
able ej with a chain term tree which consists of a 1-height-bounded variables for 
any j > k, and [wg, tci], • • ■ , [wi-i,Wi] variables in r' with which is replaced, 
where a = size(ej). Let uq be the parent of wq and S = lw£,W£-i, . . . ,ico 7 Mo]. 
In the repeat-loop of the procedure Variable-Replacing in Fig. 7, for a tree 
triangle{r' , S), the membership oracle outputs ”no”. But, triangle{r' , S) is in 
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Procedure Variable-Replacing{t) ; 

Given: The oracle Menir, for a target r* G CJTT^’^ and a term tree t obtained from r* 
by replacing each variable with the longest substitutable chain tree which 
consists of 1-height-bounded variables; 

begin 

Let V = [vi,V 2 , ■ ■ ■ ,ve] be the sequence of vertices in t by post-order traversal; 
r := t- 

while V 7 ^ [vi] do begin 
Let u be the first element in V ; 

S := [m]; 

repeat 

Let V be the parent of u; 
if [u, m] G Ht then begin S := S + [i>]; end 
else begin break; end; 
t' := triangle{t, S); 

if Menir, {t') = yes then begin u := v; end 
else begin S' S' — [u]; break; end; 
until u has more than one child; 

if u has more than one child and |S| > 2 then begin 
Let R be the sequence of vertices in r corresponding to S; 

Let Ur be the vertex in r corresponding to u; 
r' := tree{replace{r, 7?)); 
if Menir^ir') = yes then begin 
r := replace{r, R); 
end else begin 

R := R — [«r]; r := replace{r, R); S S — [u]-, 

end; 

Let w be the vertex which is added in S at last; 

Remove vertices in S — [w] from V ; 
end else if |S| > 2 then begin 
Let R be the sequence of vertices in r corresponding to S; 
r := replace{r, R); 

Let w be the vertex which is added in S at last; 

Remove vertices in S — [w] from V ; 
end else if |S| = 1 then begin 
Remove the vertex u from V ; 

end; 

end; 

return r; 
end. 



Fig. 7. Procedure Variable-Replacing 



This is a contradiction. For ^ > 7',, we can show in a similar way. For 
any i > 1, size{e^*) = size(e^). Thus, the algorithm outputs a term tree r with 
r, = r. 
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Algorithm LEARN_OTTl 

Given: The oracle Menir, for a target r* G CJTT^’^ and one positive example t\ 

Output: A term tree r £ OTT^'^ with r = r*; 

begin 

t ;= Edge-Contraction(t)', 
t ;= Varmble-Extension(t); 
r := VariableJReplacing{t)-, 

output r; 
end. 



Fig. 8. Algorithm LEARN _OTTl 



Let t be a given positive example. The procedure Edge -Contraction uses at 
most 0(|tp) membership queries. In the procedure Variable-Extension, the first 
foreach-loop uses at most 0(|r, |) membership queries, and the repeat-loop uses 
at most 0{size{r^,)'^) membership queries. The procedure Variable-Replacing 
uses at most 0(size(r»)) membership queries. Thus, the algorithm uses at most 
O(n^) membership queries in total, where n = max{|t|, szze(r*)}. □ 

Next, we show that any term tree in OTT^ is exactly identifiable using 
membership queries and one positive example. 

Let t = (Vt, Et, Ht) be a term tree in OTT^ and e = [wq, vi, - ■ ■ ,Vi] > 2) 
a variable in Hf. We define the replacement of e to f as the following operation: 
The operation replaces the variable e with variables [uq, vi ],..., [vq, vi], that is, 
Ht = Ht U {[vo, fi], • ■ • , [uo) ~ {e}. If e has an z-height-bounded variable 
label, then each variable [vo,z;j] {j = 1, ... ,t) has an z-height-bounded variable 
label. We denote by varRep{t, e) the term tree obtained from t by applying the 
replacement of e to t. 

Lemma 2. Let t = (Vt,Et,Ht) be a term tree in OTT^ and e a 1-height- 
hounded variable in Ht which has k child ports, where k > 2. Then, Lj[{t) = 
LA{varRep{t, e)). 

Let k and i be positive integers. We say that a term tree f is a (fc, £)-broom tree 
if Vt = {zzo, . . .,Uk-i,vi , . . Et = {{zzo,zzi}, . . ., {zzfc-2, Mfc-i}, {■Ufe_i,z;i}, 

{uk-i,Vi}} and Ht = 0. For example, in Fig. 4, a tree te is a (3,5)-broom tree. 

Theorem 2. The algorithm LEARN -OTT of Fig. 9 exactly identifies any term 
tree r» G CJTTa in polynomial time using at most O(n^) membership queries 
and one positive example t, where |yl| > 2 and n = max{|t|, sz 2 :e(r*)}. 



Proof. (Sketch) By Lemma 2, we pay no attention to variables of which the sizes 
are 1. Let e^*,e 2 *,. . . , e[*,. . . be the sequence of variables of which the size are 
more than 1 in r* by breath- first search order. For the term tree r outputted by 
the algorithm LEARN -OTT, let e\, e^, ■ ■ ■, e[, ... be the sequence of variables 
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Algorithm LEARN_OTT 

Given: The oracle Menir. for a target r* G OTT^ and one positive example t; 

Output: A term tree r G CJTTa with La{t) = LA{rt)\ 

begin 

s = {Vs, Es,H,) := LEARN _OTTl{t)-, 

r := s; 

flag:=false; 

Let E = [v\, ... ,vi] be the sequence of vertices in s 
by breath- first search order except the root; 
i := 1; 

while E is not empty do begin 
S := [Wi]; 

repeat 

Let Ui be the parent of Vi and Ui+i the parent of Wi+i; 
if [ui,Vi] 0 Hs or [rti+i,Wi+i] 0 Hs then begin break; end; 
a := [ui,Vi]-, 

Ci+i := [wi+i, Wi+i]; 

if Vi+i is the next sibling of Vi and size(ei) = size{ei+i) > 1 then begin 
S := S + [Vi+i]-, 
m — size{ei)-, 

Let r' be a tree obtained from r by replacing variables 
which consist of vertices in S with an (m, |S'|)-broom tree, 
and other variables with the substitutable longest chain trees; 
if Memr.,{r') = yes then begin 
flag:=false; 
i :=iW, 
end else begin 
fiag:=true; S := S — [ui+i]; 
end; 

end else flag:=true; 
until flag; 

if I S'! = 1 then begin Remove vertices in S from E; end 

else begin 

Let R be the sequence of vertices in r which corresponds to vertices in S -|- [ui]; 
Replace variables which consist of vertices in R 
with a variable which has |R| — 1 child ports; 

The variable has m-height-bounded variable label; 

Remove the vertices in S from E-, 
end 

i :=i + l-, 

end; 

output r; 
end. 



Fig. 9. Algorithm LEARN _OTT 



of which the size are more than 1 in r by breath- first search order. We assume 
that there exists a positive integer i such that the number of child ports of e[* 
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is different from the number of child ports of e^. Let k be the minimum integer 
satisfying the above condition, n* the number of child ports of e^* and n the 
number of child ports of e^. Let = [tcoj wi, ■ ■ ■ , Wn] and w„+i the next sibling 
of Wn, where Wq is the parent port of and Wi, , Wn are the child ports of e^. 
We assume n < n^. Let r' be the term tree obtained from r by replacing each 
variable ej (j > k) with aj variables having one child port and an Lj-height- 
bounded variable label and S = [ici, . . . ,Wn,Wn+i], where aj is the number of 
child ports of ej and £j is the size of ej. In the repeat-loop of the algorithm 
LEARN_OTT, let r" be a tree obtained from r' by replacing variables which 
consist of vertices in S with an (m, |S'|)-broom tree, and other variables having 
f3 (> 1) child ports with (3 substitutable longest chain trees, where m is the size 
of e^. For a tree t” , the membership oracle outputs ”no”. But, the tree r” is 
in This is a contradiction. For n > we can show in a similar way. 

For any i > 1, the numbers of child ports of e[* and et are the same. Thus, the 
algorithm outputs a term tree r with = Lj\{r). 

Let t be a given positive example. By Theorem 1, the algorithm 
LEARN^OTTl uses at most O(n^) membership queries, where n = 
max{|t|, szze(r,)}. In the algorithm LEARN^OTT, the while-loop uses at most 
0(sf2:e(r»)) membership queries. Therefore, the algorithm uses at most O(n^) 
membership queries in total. □ 



5 Hardness Results on the Learnability 

In this section, we show the insufficiency of learning of OTT^'^ and OTT^ in 
the exact learning model. We uses the following lemma to show the insufficiency 
of learning of OTT^'^ and OTT^. 

Lemma 3. (Ldszlo Lovdsz [5]) Let LF„ he the number of all rooted unordered 
unlaheled trees which have exactly n vertices. Then, 2” < Wn < 4", where n > 6. 

From this lemma, if |7l| > 1, then the number of unordered trees which have 
exactly n vertices is greater than 2". Thus, the number of ordered trees which 
have exactly n vertices is greater than 2” . The following lemma is known to show 
the insufficiency of learning in the exact learning model. 

Lemma 4. (Angluin [3]) Suppose the hypothesis space contains a class of dis- 
tinct sets Li, . . . , Ln , and there exists a set Lr\ which is not a hypothesis, such 
that for any pair of distinct indices i and j, Lp = Li H Lj. Then any algorithm 
that exactly identifies each of the hypotheses Li using equivalence, membership, 
and subset queries must at least N — 1 queries in the worst case. 

By Lemma 3 and 4, we have Theorem 3 and 4. 

Theorem 3. Any learning algorithm that exactly identifies all term trees which 
have exactly n vertices in OTT^’^ using equivalence, membership and subset 
queries must make greater than 2" queries in the worst case, where |yf| > 1 and 
n > 6. 




438 



S. Matsumoto and T. Shoudai 



Table 1. Our results and future works 





Exact Learning 


Inductive Inference 
from positive data 


OTTa 


Yes [7] 

membership & a positive example 
|d| >2 


Yes [10,12] 
polynomial time 
1^1 >1 


OTTa 


Yes [6] 

restricted subset & equivalence 
\A\ is infinite 


Open 


ottY 


sufficiency [This work] 


insufficiency [This work] 


A subclass of 
Yes [11] 

polynomial time 
|/1| >2 


membership & 
one positive example 
|T| >2 


membership 
equivalence 
subset 
1^1 >1 


orrft 


sufficiency [This work] 


insufficiency [This work] 


Open 


membership & 
one positive example 
|T| >2 


membership 
equivalence 
subset 
1^1 >1 


OTT^ 


Open 


Open 



Proof. For a tree r, we have LA{r) = {r}. We denote by the class of singleton 
sets of a tree in OVT^'^ which has exactly n vertices. The class is a subclass 
of OTT A ^ ■ For any L and L' in LOL' = 4>. The empty set is not a term tree 
language. Thus, by Lemma 3 and 4, any learning algorithm that exactly identifies 
all the term trees in CTTP^’^ which have exactly n vertices using equivalence, 
membership and subset queries must make f?(2") queries in the worst case, 
where |rl| = 1 and n > 6. □ 

Since we can prove Theorem 4 in a similar way to Theorem 3, we omit it. 

Theorem 4. Any learning algorithm that exactly identifies all term trees in 
OTT^ which have exactly n vertices using equivalence, membership and subset 
queries must make greater than 2" queries in the worst case, where \A\ > 1 and 
n > 6. 

6 Conclusions 

We have discussed the learnabilities of OTT^’^ and OTT^ in the exact learning 
model. In Section 4, we have shown that any term tree r* in OTTa '^ and OTT^ 
is exactly identifiable using at most 0{n^) membership queries and one positive 
example t, where |yl| > 2 and n = max{|t|, stze(r»)}. In Section 5, we have 
shown the insufficiency of learning of CTTT^’^ and OTT^ in the exact learning 
model. 
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As future works, we will study the learnability of finite unions of term trees 
with z-height-bounded variables for any z > 1, denoted by in the exact 

learning model. We can define width-constrained variables in a similar way to 
height-constrained variables. We will also study a tree structured pattern which 
can give precious knowledge to us, by combining constraints of height and width 
of trees. We summarize our results and future works in Table 1. We denote by 
OTT A the set of all term trees in which no variable has a height-bounded variable 
label, by OTT a the class of all finite sets of term trees in which no variable has 
a height-bounded variable label. 
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Abstract. We investigate regular tree languages exact learning from 
positive examples and membership queries. Input data are trees of the 
language to infer. The learner computes new trees from the inputs and 
asks to the oracle whether or not they belong to the language. From the 
answers, the learner may ask further membership queries until he finds 
the correct grammar that generates the target language. This paradigm 
was introduced by Angluin in the seminal work [1] for the case of reg- 
ular word language. Neither negative examples, equivalence queries nor 
counter examples are allowed in this paradigm. 

We describe an efficient algorithm which is polynomial in the size of the 
examples for learning the whole class of regular tree languages. The con- 
vergence is insured when the set of examples contains a representative 
sample of the language to guess. A hnite subset f of a regular tree lan- 
guage C, is representative for L if every transition of the minimal tree 
automaton for C, is used at least once for the derivation of an element of 
the set E. 



1 Introduction 

1.1 Some Linguistic Motivations 

The most astonishing discovery of Chomsky is the universal grammar which is 
a model of how the human language works. The universal grammar is an in- 
nate combinatorial system from which every language, french, english, Japanese, 
can be derived. What is the implication of Chomsky’s universal grammar on 
grammatical inference? We think it gives a strong intuition on the mathematical 
modeling of language learning. Before going further, let us focus on the linguistic 
aspect of language acquisition processes. 

Several recent works of psycho-linguists like Pinker [20] or Christophe [9] 
advocate that the universal grammar plays the role of a learning device for 
children. A child is able to determine whether or not a sentence is grammatically 
correct, even if we don’t know the meaning of each word. Of course, semantics 
speed up the learning process, but it is not necessary. And, it is fascinating 
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to see that a child needs only few informations in order to learn a language 
(poverty-of-stimulus hypothesis) . 

Another important feature is our capacity to guess a mental tree structured 
representations of phrases. How a child is able to do that is beyond the scope of 
this paper. However the child language acquisition process is not based on the 
construction of a huge finite automaton with probabilistic transitions, because 
there is an infinity of valid sentences and we have the ability to generate them. 

To sum up this brief discussion, we take as hypothesis that a child computes 
a grammar from tree structured sentences. 



1.2 A Mathematical Model 

Can we give a mathematical model of the language acquisition which corrobo- 
rates this theory? The grammatical inference paradigm of Gold [16] is a good 
candidate. Indeed, the inputs of the learning process are just examples of the 
target language. And there is no interaction with the environment. But this 
paradigm is too weak to be plausible. For this reason, we add to Gold paradigm 
an oracle which answers to membership queries. Hence, the grammatical in- 
ference is based on positive examples and membership questions of computed 
elements, as introduced by Angluin in [1]. This learning model agrees with the 
poverty-of-stimulus hypothesis. Indeed the interaction with the environment is 
very weak. For example, a child asks something, but nobody understands. From 
this lack of reaction from the environment, he may deduce that the sentence 
is wrong, and so not in the language. We insist on the fact that membership 
queries are a minimal information which can be inferred from a dialog. 

On the other hand, negative examples are not necessary because a parent 
does not say incorrect sentences to a child. One might think about other kind 
of queries like equivalence queries as suggested by Angluin [2]. But there are 
not necessary as we shall see and there are unrealistic in a linguistic context. In 
conclusion, our learning model seems quite adequate with respect to our initial 
motivation. 

Now, we have set the learning paradigm, we have to say what’s kind of 
language are targeted. As we have said, we can assume that a child has a kind 
of parser which transforms a (linear) sentence into a tree representation. So, 
we learn tree languages. Regular tree languages are the bare bone of several 
linguistic formalisms like classical categorial grammars for which learnability 
has been studied in [18,7], dependency languages [11,4,6] or TAG derivations. 



1.3 The Results 

We establish that the whole set of regular tree languages is efficiently identifiable 
from membership queries and positive examples. The running time of the learn- 
ing algorithm is polynomial in the size of the input examples. The efficiency is a 
necessary property of our model. The difficulty is to construct trees to ask mem- 
bership queries and which give useful information to proceed in the inference 
process. As far as we know, this result is new. 
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1.4 A Web Application 

There are other applications of our result. For example, an XML document is 
a tree, if we forget links. The style-sheet defines a regular tree grammar. Now, 
say that we try to determine a Document Type Definition (the grammar which 
generates XML documents). For this, we can read correct XML documents from 
a server which forms a set of positive examples. Then, we can build a XML 
document and make a membership query by sending it to the server. If no error 
occurs then the the document is in the language, otherwise it is not. 

1.5 Related Works 

Learning from positive examples and membership queries. Angluin con- 
siders the same learning paradigm in [1] for the class of regular word languages. 
The notion of observation table which is defined in [2] is already implicitly used 
in [1]. However, we can not extend in a straight forward way the algorithm of [1] 
as it is explained by Sakakibara in [22]. In Section 4.3, we shall compare more 
precisely these works with our approach. 

Other paradigms. Sakakibara studied grammatical inference of languages of 
unlabeled derivation trees of context free grammars. In [23], he extends the 
result of [2] by learning with membership queries and equivalence queries. The 
possibility of asking the teacher whether a calculated hypothesis corresponds to 
the target language seems not to be relevant for the aim of construct a model 
of natural language process. In [22,21] Sakakibara uses positive and negative 
examples with membership queries. At the end of the paper [21], it claims that 
the second algorithm is polynomial time (without proofs), but the main one is 
exponential in the function symbol arity. Compare with [21], we have showed 
that negative examples are not necessary and that the running time of our 
learning algorithm is polynomial. It is worth noticing that the set of all unlabeled 
context free derivation tree languages is a strict subclass of the set of regular 
tree languages. Therefore, we learn more and in a weaker setting since negative 
examples are not necessary in our work. This is important when we are learning 
from structured examples. Indeed, the language of structured examples may not 
come from a context free grammar but still the word language is context free. 

Inference of regular tree languages from positive examples only, has been 
studied in [15,17], in [5] and [14] learnable subclasses have been defined and in 
[8], learning is studied from a stochastic point of view. 

Inference of regular tree languages with queries has been studied in [12]; the 
learning algorithm is based on membership queries and equivalence queries and 
this result constitutes an extension of Sakakibara’s works. In [13], a polynomial 
version of the former learning algorithm has been developed. 

2 Regular Tree Languages 

We present the general definitions of regular tree languages based on the Tata 
book [10]. A ranked alphabet V is a finite set of symbols with a function arity 
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from V to N, which indicates the arity of a symbol. The set T(V) of terms is 
inductively defined as follows. A symbol of arity 0 is in T(V) , and if f is a symbol 
of arity n and ti, . . . ,t„ are in T(V), then f(ti, . . . ,t„) is in T(V). Throughout, 
labeled ordered trees are represented by terms. 

Subterms of a term t are defined by : t is a subterm of t and if f(ti, . . . ,t„) 
is a subterm of t, then ti, . . . ,t„ are subterms of t. For a set of terms S, S{E) is 
the set of subterms of elements of E. 

A context is a term c[o] containing a special variable o which has only one 
occurrence. The variable o marks an empty place in a term. In particular, o is a 
context called the empty context. The substitution of o by a term s is noted c[s]. 
We write E[o\ for the set of contexts obtained by replacing exactly one occurrence 
of a £l-subterm by o (f[o] contains the empty context o). 

A bottom up non- deterministic tree automata (NFTA) is a quadruple A = 
(V, Q, — >) where V is ranked alphabet, Q is a finite set of states, Qp C Q is 

the set of final states, and — >■ is the set of transitions. A transition is a rewrite 

rule of the form f (gi, . . . , (/„) — > q where q and qi, ■ . ■ ,qn are states of Q, and f 

A 

is a symbol of arity n. In particular, a transition may be just of the form a — >■ q 

A 

where a is a symbol of arity 0. 

The single derivation relation is defined so that t s if and only if there 

is atransition f (gi, . . . , — > q such that for acontext u[o], t = u[f (gi, . . . , g„)] 

.A 

and s = u[q 1. The derivation relation — > is the reflexive and transitive closure 
of-^. 

.4 

A tree language C is recognized by A if £ = where 

= {t G T(V) : t A g/ and qf G Qf} 

A tree language is regular if and only if it is recognized by a NFTA. 

A state q is reached by term t if t A q. A state q accepts a context c[o] if 

c[g] A qf. A NFTA is trimmed if and only if each state is reached by a term 

and accepts at least a context. In other words, there is no useless transition nor 
useless state. Notice that a trim automaton has a partial transition relation. 

A finite tree automaton is deterministic (DFTA) if there are no two rules with 
the same left hand side. It is well known that NFTA and DFTA recognize the same 
class of languages. Therefore, each regular language is recognized by a trim DFTA. 

The Myhill-Nerode congruence =£ of a tree language C is defined by t =£ s 
iff for every context c[o], c[t] G £ iff c[s] G £. Kozen in [19] has written an 
elementary proof of Myhill-Nerode Theorem and has told the story behind. The 
Myhill-Nerode congruence =c defines the minimal automaton for £ (up to a 
renaming of states) and inversely. The minimal automaton of £ noted Am is 
defined as follows: 

— the state set Qa^ is the set of non-empty equivalence classes of the Myhill- 
Nerode congruence. 
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— the set of final states Qf^ is the set of states such that [t] is contained in 

— the transition rule ^ > is the smallest relation such that 
/([ti],... ,[t„]) ^ [/(gi,... ,qn)]- 

Throughout, we note [t] the equivalence class of t wrt =c- 

Example 1 . Consider the DFTA ({a, 6, c}, {5^7 9i, 92, 93, 94}, { 9 f}, — >■), where 

A 

— > is the following set of transitions: 

A 

0(92,93) — > 9 f 

A 

^(92) 9i 0(94) 93 

A A 

^(91) — > 92 0(93) 94 

A A 




The tree a(6(6), c(c(c))) belongs to Cr- Indeed we have: 

a( 5 ( 5 ),c(c(c))) ^ a(&(9i),c(c(c))) ^ 0(92, c(c(c))) 

-4 A 

-4 0(92,0(0(93))) — 3 > 0(92,0(94)) — > 0(92,93) — > 9 f 
A AAA 

Ea is the tree language {a(&^”+^, 0^’”+^) : n,m G N}. 

An automaton homomorphism between the NFTA A = (V, Q, Qf , — >) and 

A' 

A' = (V', Q', Q'p, -^) is a mapping </> from Q and Q' such that : 

— for each transition f(9i, . . . , 9„) — > q of A, f(c/>(qi), . . . ,c/>(qn)) — > 4>{q) is 

A A' 

a transition of A' . 

- <^{Qf) C Q'p. 

This implies that Ca Q Ea' ■ If 4^ is bijective, it consists in a renaming of the 
states and Ea = Ea<- 

3 Learning Regular Tree Languages 

3.1 The Learning Paradigm 

The goal is the identification of any unknown regular tree language E with help 
of a teacher. The teacher is an oracle which answers to membership queries. 
The learning process begins with a set of positive examples. The a dialogue is 
established between the learner and the teacher. Then learner asks whether or 
not a new tree belongs to the unknown language. The teacher answers by “yes” 
or “no” to this query. This learning process halts after a finite number of queries. 

We shall provide a necessary and sufficient condition to guess the unknown 
language and so to construct a DFTA which recognizes it. 
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3.2 Representative Samples 

Informally, a representative sample of a language £ is a finite subset S such that 

each transition of the minimal automaton is used to produce a term of £. 

The set £ is a representative sample of C if for each transition f (gi, . . . , g„) > 

Am 

q, there is a term f(ti, . . . ,t„) in S{£) such that VI < f < n, ti > qt. 

Am 

Example 2. This example illustrates the fact that several distinct languages can 
have identical representaive samples. Let A be the DFTA defined in Example 1 
and A' the DFTA defined by: 



a{qi, fc) 



b 




1i 

qi 




12 




12 



where qp is the unique final state of A'. From this definition, we have 

= {a(&”,c™) : n,meN*j 



and the singleton set 



(a(b(b),c(c(c)))} 



is a representative sample for both and £_4' ■ 



Remark 1. The question of the size of a minimal representative sample relatively 
to the size of the minimal automaton ^ of a language £ is interesting to be 
discussed. In the contrary of the case of word language, there is no polynomial 
relation between the size (the total number of nodes) of a minimal characteristic 
sample and the size (the number of states) of the minimal DFTA. For instance, 
for two integers n and m, the singleton language containing the following tree: 







where the arity of a is m and the arity of 6 is 0, is a regular tree language 
recognized by the above minimal DFTA. 

a{qn, ... ,qn) qp 



a{qi,... ,qi) -> q 2 
b -A qi 

The size of the representative sample which the singeton language above, is 
— 1 . 
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3.3 Observation Tables 



Following the method of Angluin [3], informations obtained from the queries are 
stored in a table. Let £ be a tree language, £ be a finite set of terms and F be 
a finite set of contexts. The observation table T = Tc{S, F) is the table defined 
by: 

— rows are labeled with terms of S{£), 

— columns are labeled with contexts of F, 

— cells T£(t,c[o]) are labeled with 1 or 0 in such a way that 



Tc {t,c[o]) 



1 if c[t] G £ 
0 otherwise 



We call row(t) the binary word corresponding to the reading from left to right 
of the row labeled by t in T. 

Example 3. Consider the tree language £r defined in Example 1, £ the set of 
terms: 

£ = {a(6(6),c(c(c)))} 

and F the set of contexts: 



F = £[o] = {o, a(o, c(c(c))),a(b(o),c(c(c))),a(b(b),o),a(b(b),c(o)),a(b(b), c(c(o)))} 



The corresponding observation table T = Tc {£, F) is the following: 





o 


a(o,c(c(c))) 


a(&(o),c(c(c))) 


a{b(b),o) 


a(&(b),c(o)) 


a(&(b),c(c(o))) 


«(&(&), c(c(c))) 


1 


0 


0 


0 


0 


0 


b[b) 


0 


1 


0 


0 


0 


0 


b 


0 


0 


1 


0 


0 


0 


c(c(c)) 


0 


0 


0 


1 


0 


1 


c(c) 


0 


0 


0 


0 


1 


0 


c 


0 


0 


0 


1 


0 


1 



An observation table T = Tc {£, F) defines a NFTA 
At = (Vt, Qt, Qf,t, -^) 

where: 

— Vt is the set of symbols occurring in £, 

- Qt = {row(t),t G 5(£)}, 

- Qrt = {row(t),t G £} , 

— the transition — >■ is the smallest relation such that 

T 

f(row(ti), . . . ,row(t„)) -4 row(f(ti, . . . ,t„)), 



for each f(ti, . . . ,t„) G S{£). 
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Example 4- The table T of the Example 3 defines the NFTA At as: 

- Vt = {a, b, c} 

- Qt = {100000, 010000, 001000, 000101, 000010} 

- Qp j. = { 100000 } 

- — is the following set of transitions 



5(010000) ^ 001000 c(OOOOlO) 

At 

5(001000) 010000 c(OOOlOl) 

At 

5 ^ 001000 c 

At 



000101 

000010 

000101 



We remark that At = </'(-4), where A is the DFTA introduced in Example 1 and 
(j) is the renaming defined by: 4>{qF) = 100000, 4>{qi) = 001000, (j){q 2 ) = 010000, 
^(ga) = 000101, (j){q4) = 000010. 



Remark 2. In general, At is not complete and non deterministic. 

An observation table T = Tc{S, F) is said to be consistent if for any terms 
f (ti, ... , t„) and f (t{, . . . , t(j) in S{£) , if VI < j < n we have : 

row(tj) = row(t') 



then 

row(f(ti,... ,t„)) = row(f(t},... ,t'„)) 



Lemma 1. At is deterministic if and only if the table T is consistent. 

Proof. This equivalence is straightforward from definitions of consistency of T 
and of At. 



Lemma 2. If £ is a representative sample for a regular tree language L, F 
any finite set of contexts and T the observation table Tc{£A), then we have 

CQLat- 

Proof. We exhibit an automaton homomorphism (j) from the minimal automaton 

Am onto At. For each subterm t of £, we set <()([t]) = row(t). It is clear that 

if [t] = [t'J then we have row(t) = row(t') and so </> is so well defined. Since 

f is a representative sample, the range of (j) is Qt. Finally, for each transition 

f ( 91 , ■ • ■ ,qn) — -f q, we have f (</>(gi), . . . (f{qn)) — T (f{q)- 
Am At 

Remark 3. The number of states of the automaton At is lower or equal to the 
number of states of Am- 
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Remark 4- If the automaton homomorphism (j) defined in the proof of Lemma 2 
is bijective, then £ = Lat- Consequently, if £ then there are two 

subterms t and t' of S such that row(t) = row(t') and [t] ^ [t']. 

The next lemma is central for the construction of an algorithm that solves 
the learning problem. The consistency of the table constructed from a represen- 
tative sample £ implies that the corresponding automaton is minimal. So when 
the table is consistence then £ is learned with the proviso that the inputs are 
representative. 

In other words, when the table is not consistent, there are two subterms t 
and t' of £ for which the rows are equal in the table, but [t] yf [t']. 

Lemma 3. Let T = Tc {£, F) be an observation table where L is a regular tree 
language, £ a representative sample for C, f a set of context containing £[<>]■ 

If L At then there are two terms 

/(ti,... ,t„) and f{t[,... ,t'„) 

in S{£) such that there is an index i satisfying 

- row{f{ti,... ,£_i,ti,b+i,... ,t„)) yf row{f{t[,... ,t'_i,t',t'+i, . . . ,t'„)), 

- [tj] = VI < j yf * < 

- row{u) = row{t'f). 

The proof of Lemma 3 will appear in the full version of the paper. The first 
consequence is algorithmic. Indeed, Lemma 3 implies that a context c[o] can 
be constructed in order to separate t^ and t'. By adding c[o] to the table, the 
corresponding rows of ti and t' will be then different. Such a context is called 
separating- context. Moreover, this separating context is obtained by plugging 
f (ti, . . . , ti_i, o, ti+i, ... , t„) to a context of F. This observation restricts drasti- 
cally the number of separating contexts to consider which leads to a polynomial 
time learning algorithm. 

The second consequence is that the learning problem is solved when the table 
is consistent as we have already claimed. 

Lemma 4. Let L be a regular tree language, £ a representative sample for £ 
and F a set of contexts including £[<>]■ If the table T = Tc{£,f) is consistent, 
then Cat — ^ ■ 

Proof. The Lemma 3 means that if a table is consistent then two equivalent 
terms wrt Myhill-Nerode congruence have the same row. From this, we define 
an automaton homomorphism from At onto Am. So, Cat The conclusion 
follows by Lemma 2. 

4 Identification of Regnlar Tree Languages 

4.1 The Algorithm Altex 

The algorithm Altex is defined in Figure 1. Altex first receives a finite subset 
£ of an unknown language £. Altex constructs the first observation table T = 
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Tc{E,£[o\) with help of queries. Then, it checks the consistency of the table. 
Each time Altex finds the table non-consistent , new contexts constructed from 
the terms that contradict the consistency are added in the table which is then 
completed with queries. The process stops when the table is consistent and the 
automaton At is output. 



Input: a finite set of terms £ 

Initialization: F = ^ [o]; 

Construct the table T = Tc{£, F); 

while there is f(ti, . . . ,t„) and f(t^, . . . ,tjj) in S{£) such that 
row(f(ti,... ,t„)) yf row(f(t;,... ,t'„)) 
and VI < i < n, row(tj) = row(t^ do 
Find a context c[o] in F such that 
c[f(ti,... ,t„)] G C and c[f(t;,... ,t;)] ^ £; 

F = F U {c[f(ti, . . . ,ti_i,o,ti+i, . . . ,t„)], 1 < z < n}; 
Construct T = Tc{£, F); 
end while; 

Return the automaton At- 



Fig. 1. The learning algorithm Altex 



4.2 Correctness and Termination 

From the definition of consistency, Altex can easily verify whether the table 
T constructed with help of membership queries is consistent or not. In the case 
where T is not consistent, the problem is that the algorithm has to find by itself 
(no counter example is allowed) a new context that will separate two equivalent 
rows. The key point is that the input set of terms £ is representative provides 
the ability of determining such separating context. 

Lemma 5. Assume that the inputs o/ A ltex is a representative sample. If the 
table T = Tc {£, F) is not consistent, Altex calculates a separating-context. 

Proof. Altex collects every couple of terms f(ti,... ,t„) and f(t^,... ,t(j) in 
S{£), such that there is a context c[o] in F with 

— VI < z < n, row(ti) = row(t'), 

- c[f(ti, . . . ,t„)] G £ 

-c[f(t;,...,t;)]^£. 

Among the couples gathered as above. Lemma 3 states that there is an index 
z such that 

< j ^ i <n, [tj] = [t'] 

and 

[L] [t'] 
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So, c[f(ti, . . . , ti_i, o, ti+i, . . . ,t„)] is a separating context which is added to F. 
The rows of ti and t' are now different in 

Tc(^,FU{c[f(ti,... ,t*_i,o,ti+i,... ,t„)]}) 

Theorem 1. The algorithm Altex identifies the class of regular tree languages 
in polynomial time. 

Proof. The algorithm Altex starts with the construction of Tc{S , S\o]) and en- 
ters the loop. If the program leaves this loop, the observation table is consistent 
and by Lemma 4, the automaton given as output is correct. It remains to show 
that Altex terminates. From Lemma 5, each time the loop is processed, a new 
separating-context is added to the table. Since two rows that were different, are 
still different in the new table, the number of states of the automaton At is 
strictly increased. From Lemma 3, this number is always lower or equal to the 
number of states of the minimal automaton for C. This implies that the loop 
may be processed only a finite number of times and by consequence, Altex ter- 
minates. 

We noticed that any finite set containing a representative sample is also a 
representative sample. This implies that if we consider an incremental version of 
Altex (the table is completed as the set of positive examples increases during the 
process), the algorithm converges. Now, if the input set doesn’t contain a repre- 
sentative sample yet, the algorithm calculates a sub-automaton (the recognized 
language is strictly contained in the target language) . The algorithm terminates 
on any input and the success of learning is guaranteed if a representative sample 
is presented as input, that constitutes a weak hypothesis. 

The time complexity of Altex depends on the size n of a representative 
sample S (The size is the sum of the size of terms in S.) and the size m of the 
minimal automaton (number of states) for the language C to identify. The first 
observation table has cells. Then, the number of rows doesn’t change and 
the number of columns increases until the table is non-consistent. At each step, 
there is at most n new contexts which are added to the table. And the loop is 
bounded by m. Now, since we have m < n, we conclude that the runtime is 
bounded by a polynomial in n. 

4.3 Why Does It Not Blow Up? 

In [I], Angluin studies the paradigm of learning regular (word) languages from 
positive examples and queries and in [3], the idea of observation table is intro- 
duced. It’s interesting to see whether these results may be applied in the case of 
tree languages. Angluin’s algorithms try any possible transition by considering 
the set of words u)a, where w is a prefix and a is an element of the alphabet. To 
apply this technique in the case of trees, we have to construct, for any subterm 
t of S, all terms of the form f(ti, . . . ,t, . . . ,t„), for each subterm tj of E and 
each element f of arity n in the alphabet. This straight generalization leads to 
an exponential procedure in the maximum arity of the alphabet. 

Altex proceeds in a different way. It determines efficiently from the non- 
consistent table a small set of contexts in which one is separating. 
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5 Examples 

In Example 2, we saw that the tree a{b{b), c(c(c))) is a representative sample for 
the tree language 

£_4 = {a(62”+2,c2™+i) : n,mGN} 

Now suppose that the above singleton is given to Altex ; the constructed table 
with help of membership queries is the table of example 3. This table is consistent 
and then, Altex output the automaton of example 4. This automaton is a 
renaming of A and the language is learned. If we suppose that, with the same 
input, the language to learn is 

= {a(6", c™) : n, mGN*} 

({a(6(6), c(c(c))} is representative for this language too), the table is now: 





0 


a(o,c(c(c))) 


a(6(o),c(c(c))) 


a(b{b),o) 


a(6(&),c(o)) 


o(&(&),c(c(o))) 


a(b{b),c{c{c))) 


T 


0 


0 


0 


0 


0 


m 


0 


1 


1 


0 


0 


0 


b 


0 


1 


1 


0 


0 


0 


c(c(c)) 


0 


0 


0 


1 


1 


1 


c(c) 


0 


0 


0 


1 


1 


1 


c 


0 


0 


0 


1 


1 


1 



Altex checks again that this table is directly consistent and output the 
automaton where (j)' is the automaton homomorphism defined by: 

(()'{qF) = 100000 , = 011000 , (j)'{q2) = 000111 . 

Altex is defined as an iterative algorithm; if during the process, the obser- 
vation table is found not to be consistent, the minimal automaton for the target 
language must have some particular rules: rules which are identical except in a 
single state of its left hand side If the minimal automaton has no such rules, 
the first table constructed by Altex is consistent and the language is learned 
immediately. With the aim of illustrating the iterative behavior, we now propose 
an automaton specially constructed for having this property. Let so be the 
language defined by the following minimal automaton A!'\ 

a{qi) -T qp d{q^, gs) — > 6(57) gs 9 qr 

A AAA 

a(g2) ^ qp d(g4, gs) — > gi e(gs) ^ g4 h—^qs 

A AAA 

b{qi) — > qp d{q3, ge) -T gi /(gg) -T gg f > gg 

.4 AAA 

c — > gi d(g 4 , ge) — > 92 /(gio) — > ge j — > gio 
.4 .4 .4 .4 

where qp is the unique final state. From this definition, we establish that Ca" 
is the finite language corresponding to the following set of terms: 



£a" = {a(c), b{c),a{d{e{g), f {{))), a{d{e{h),f {{))), a{d{e{g), /(j))), 
a{d{e{h),f{j))),b{d{e{g),f{i))),b{d{e{h),f{i))),b{d{e{g)J{j)))}. 
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Let suppose that a teacher constructs the representative sample: 

S = {b{c),a{d{e{g),f{i))),a{d{e{h),f{i))),a{d{e{g),f{j))), a{d{e{h) , f (j)))} . 
The learner Altex starts with the construction of F = S\o\ and T = 

Altex now notices the three problematics couples of terms 



d{e{h),f{i)) and d{e{h), f{j)), 



d{e{g),f{i)) and d{e{h),f{j)) 



and 

d{e{g),f{j)) and d{e{h), f{j)). 

Indeed 

row(e(g)) = row(e(/i)) and row(/(i)) = row(/(j)) 

but 

Hd{e{g)J{i))) G C,b{d{e{g), f{j))) G C,b{d{e{h), f{i))) G C 

and 

b{d{e{h)J{j)))iL. 



Altex adds b{d{e{g),o)), b{d{e{h),o)), b{d{o, f{i))) and b{d{o, f{j))) to F and 
complete the table T with help of the teacher. 

Altex now remarks that 



b{d{e{h),f{i))) G £ , b{d{e{h) , f (j))) i C but row(f) = row(j) 

and that 



b{d{e{g),f{j))) G £,b{d{e{h), f{j))) ^ £ but row{g) = row(ft-). 

The new contexts b{d{e{o) , f (j))) and b{d{e{h) , f {o))) are added in F and the 
table T is completed one last time with help of the teacher. 

T is now consistent and Altex output At which verify £_a^ = £_a'i . 



6 Conclusion 

We have showed that the whole class of regular tree language is learnable from 
positive examples and membership queries, that extends Angluin’s works. We 
gave a polynomial algorithm for this learning model and proved its correctness. 
It is worth noticing that, with the same proofs, Altex is able to learn from 
partial examples, that is subtrees of positive examples. Indeed, in the definition 
of representative samples, we stipulate that a representative sample must be a 
subset of the language; there is no need for this restriction and the property that 
any transition of the minimal automaton is necessary is enough. This constitutes 
a new paradigm that can be even more realistic for the problem of natural 
language acquisition modeling. A part of a sentence can be used by a child, even 
if the sentence in which it appears is not totally understood. Moreover, this may 
partially answer the problem of noise in the signal processing. 
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Abstract. One of the most important issues in educational systems is to define 
effective teaching policies according to the students learning characteristics. 
This paper proposes to use the Reinforcement Learning (RL) model in order for 
the system to learn automatically sequence of contents to be shown to the stu- 
dent, based only in interactions with other students, like human tutors do. An 
initial clustering of the students according to their learning characteristics is 
proposed in order the system adapts better to each student. Experiments show 
convergence to optimal teaching tactics for different clusters of simulated stu- 
dents, concluding that the convergence is faster when the system tactics have 
been previously initialised. 



1 Introduction 

Web-based education (WBE) is currently a hot research and development area. Tra- 
ditional web-based courses usually are static hypertext pages without student adapt- 
ability, providing the same page content and the same set of links to all users. FIow- 
ever, since the last nineties, several research teams have been implementing different 
kinds of adaptive and intelligent systems for WBE. 

The Web-based Adaptive and Intelligent Educational Systems (Web-based AIES) 
use artificial intelligence techniques in order to adapt better to each student. One of 
the AIES main problems is to determine which is the best content to show next and 
how to do it. 

RLATES (Reinforcement Learning Adaptive and intelligent Educational System) 
is a Web-based AIES. This proposal provides intelligence and student adaptability in 
order to teach Database Design topics. RLATES is able to adapt the hypermedia 
pages contents and links shown to the students (adaptive presentation and adaptive 
navigation support) based on the Reinforcement Learning (RL) model [5] in order to 
provide the student an “optimal” curriculum sequence according to his/her learning 
characteristics in each moment of the interaction. This approach forms part of the 
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PANDORA project [3], whose main goal is to define methods and techniques for 
database development implemented in a CASE tool. 

System learning begins with a clustering of the students according to their charac- 
teristics. A different pedagogical tactical (action policy from the Reinforcement 
Learning point of view) is learned for each of those clusters. The goal of this paper is 
to study how many students are necessary to interact with the system until an optimal 
pedagogical sequence of contents is learned. Experiments will show that it depends 
on the student clusters homogeneity, so if all the students belonging to the same clus- 
ter have similar learning characteristics, the system adapt his curriculum sequence 
faster and better for each student. Eurthermore, we will show that initializing the 
action policy improves the learning convergence to a very reduced number of stu- 
dents, even when the initialization is done assuming student with different learning 
characteristics. 

The paper is organized as follows: first, the proposal architecture is described in 
Section 2. Section 3 shows how to apply Reinforcement Learning to educational 
systems. Next, the functional phases are defined in Section 4. Experiments and main 
results are presented in Section 5. Finally, the main conclusions and further research 
of this work are given in Section 6. 



2 Proposal Architecture 

RLATES is composed of four well differentiated modules shown in Figure 1. The 
student module contains all important information about the student in the learning 
process: goals, student background knowledge, personal characteristics, historical 
behavior, etc. Experiments in section 5 show the importance of constructing a good 
student model and clustering the learners according their critical learning characteris- 
tics. A great variety of student models and techniques have been studied [8], and any 
clustering technique could be used to assort students according to their learning char- 
acteristics. 

The domain module contains all characteristics of the knowledge to teach. For the 
experimentation analysed in this paper the Database Design domain has been used. 
The hierarchical structure of topics is used for the domain knowledge, where each 
topic is divided in other topics and tasks (sets of definitions, examples, problems, 
exercises, etc.) in several formats (image, text, video, etc.). 

In the pedagogical module, the educational system finds the best way (the se- 
quence of contents) to teach the knowledge items, corresponding with the internal 
nodes of the tree (topics), to the current student. The definition of this problem as a 
Reinforcement Learning problem is explained in Section 3. 

Finally, the interface module facilitates the communication between the AIES and 
the student. This module applies intelligent and adaptive techniques in order to adapt 
the content and the navigation to the students, leaning on the pedagogical module, 
that decides which is the next task to be shown to the student and in which format the 
knowledge is going to be taught. 
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3 Application of Reinforcement Learning 

Reinforcement learning problems [7] treat agents connected to their environment via 
perception and action. On each step of the interaction, the agent observes the current 
state, s, and chooses an action to be executed, a. This execution produces a state tran- 
sition and the environment provides a reinforcement signal, r, that indicates how 
good the action has been to solve a defined task. The final goal of the agent is to 
behave choosing the actions that tend to increase the long-run sum of values of the 
reinforcement signal, r, learning its behavior by systematic trial and error, guided by a 
wide variety of algorithms [7] . 




In RLATES, a state is defined as the student knowledge. It is represented by a 
vector of values related to domain knowledge items (internal nodes of the domain 
knowledge tree). The i-th value of the vector represent the knowledge level of the 
student about the i-th topic. In this work we present a simple example and, in order to 
limit the size of the state space, possible values of the student knowledge are limited 
to 0 (the student does not know the item) and 1 (the student knows the item). Some- 
times, educational systems need to know how good the student learns a topic. This 
information can be easily added to our system by extending the possible values of the 
knowledge vector. Furthermore, RLATES perceives the current student state (the 
values of the knowledge vector) by evaluations (tests). The system actions correspond 
with the hypermedia pages that the educational system shows to the student (the tasks 
of the knowledge tree: definition, exercise, problem, etc.). The reinforcement signal 
(r), supplies a maximum value upon arriving to the goals of the tutor; i.e., when the 
student learns the totality of the contents of the educational system. This signal is 
used to update the system's action policy. The system behavior, B, should choose the 
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actions that tend to maximize the long-run sum of values of the reinforcement signal, 
choosing in this way the optimal tutoring strategy (what, when, and how to teach; the 
best sequence of contents and how to teach them) to coach the current learner. The 
action-value function, Q(s,a), estimates the usefulness of executing one action, 
(showing leaves of the knowledge tree to a student) when the system is in given 
knowledge state, s. This function provides an action policy, defined as is shown in 
equation (1). 

n(s)=arg max^j Q(s,a,). (1) 

The goal of the learning process is to find the policy such that it maximizes this 
function, i.e., to obtain the optimal value-action function, denoted by Q‘(s,a), such 
that Q‘(s,a) > Q(s,a) V a e A, s e S. There are several ways to learn this function, 
from dynamic programming [2] to model-free methods [9]. The algorithm imple- 
mented in RLATES is the Q-learning algorithm [10], where its value-action function 
is defined in the equation (2). 



Q(s,a)=(l-a) Q(s,a)+ a(r+ jmax^. Q(s',a')J 



( 2 ) 



This equation requires the definition of the possible states, s, the actions that the 
agent can perform in the environment, a, and the rewards that it receives at any mo- 
ment for the states it arrives to after applying each action, r. The y parameter controls 
the relative importance of future actions rewards with respect to new ones, and a 
parameter is the learning rate, that indicates how quickly the system learns. 

In Table 1, how the Q-learning algorithm has been adapted to educational system 
is shown. 

The Boltzmann exploration policy has been used in RLATES, because it has been 
demonstrated previously that it improves the system convergence in relation to the e- 
greedy exploration policy [5]. This exploration policy estimates the probability of 
choosing the action a according to the function defined in equation (3), where x is a 
positive parameter called the temperature. If the temperature is high, all the prob- 
abilities of the actions have similar values and if the temperature is low, it causes a 
great difference in the probability of selecting each action. 



e,(») 



P{a) = 




(3) 



4 System Functional Phases 

The use of RLATES requires three phases in order to adapt better to each student in 
every moment of the interaction: student clustering, system training, and system use. 
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4, 1 Student Clustering 

RLATES is able to adapt to each student individually, updating its Q(s,a) function 
according to the interaction with the student. 

If the system maintains only one Q table for all the students that could interact, 
RLATES adapts to the set of all the students. However, they could have very different 
learning characteristics and the adaptation could be low. 

Table 1. Adaptation of the Q-learning algorithm to Educational Systems 

Q-learning adapted to AIES domain 

- For each pair (sg S, aG A), initialize the table entry Q(s,a). 

- Do for each student of the same cluster 

o Test the current student knowledge, obtaining s 
o While the student has not finished learning (s is not a goal state) 

• Select a knowledge tree leaf, a, to show to the student, following a exploration 
strategy. 

• Test the current student knowledge, s' 

• Receive the immediate reward, r. A positive reward is received when the AIES 
goal is achieved. A null reward is obtained in any other case. 

• Update the table entry for Q(s,a), that estimates the usefulness of executing the a 
action when the student is in a particular knowledge state: 

• Q(s,a)=(l-a) Q(s,a)+ a(r+ ymax^, Q(s',a')j 

• Let us s the current student knowledge state, s'. 



The solution is to cluster the students according to their learning characteristics be- 
fore the interaction with RLATES. In this sense, the system maintains one Q table for 
each cluster of students. This allows the system to adapt better to each student cluster. 

This phase is not necessary, but it is recommended for better adaptation to each 
user interacting with the system. 

In this paper, a comparison of the system convergence when students of different 
clusters interact is presented. Two different kind of clusters have been defined based 
on the homogeneity of the students in the cluster (how similar their learning charac- 
teristics are). Only two learning characteristics have been used in order to define the 
population of students: the kind of task (introduction or definition) and the format of 
the hypermedia page (video or text) that they require to learn, as defined in Table 2. 

In the clusterl, all the students learn only with definition tasks. They will learn the 
task with a probability of 0.95 if the task has video format, while if the task has text 
format, it will be learnt only with a probability of 0.05. 

The clusterl students, on the other hand, require definitions and introductions, 
both with a percentage of 50%. They will learn the task in the video format with a 
percentage of 75% and the tasks in the text format with a probability of 0.25. 
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It is hypothesized that the RLATES would adapt better to the clusterl students in- 
dividually, because they have very similar learning characteristics and the cluster2 
students are more heterogeneous. 



Table 2. Students Cluster Types 



Student Clusters 


Tasks 


Formats 


Cluster Type 1 

(more heterogeneous cluster) 


Definitions (100%) 


Video (95%) & 
Text (5%) 


Cluster Type 2 

(the most heterogeneous cluster) 


Definitions (50%) & 
Introductions (50%) 


Video (75%) & 
Text (25%) 



4.2 System Training 

In this phase, the proposal explores new pedagogical alternatives in order to teach the 
system knowledge, sequencing the domain content in different ways. At the same 
time the system interacts with the students and updates the appropriate Q table. 

In this way, the system is able to converge to good teaching strategies based only 
in previous interaction with other students. This is related to the explora- 
tion/exploitation strategies in Reinforcement Learning. In [6], the advantages of the 
Boltzmann exploration strategy are demonstrated. That is why the RLATES system 
uses this exploration policy. 

In this phase, the students that interact with the system could not be learning in the 
best way, because the system has not learned yet the optimal pedagogical policy. So, 
it is desired to minimize this phase as possible. This phase finishes once an near opti- 
mal action policy has been learned. 



4.3 System Use 

When the system has converged to a good pedagogical strategy, it is time to use this 
information to teach other students with similar learning characteristics. Erom the Q- 
learning algorithm point of view, that means to set the learning rate ( a) parameter to 
zero and to select a greedy strategy for action selection. These students will achieve 
their knowledge goals in the best way the system has learned. 

Although the system interaction with students has been divided in two phases 
(System Training and System Use), the system never stops learning, adapting its 
value-action function, Q(s,a), in each step of the interaction with each student. It is 
said that the system is in this phase when it has learned a near optimal teaching se- 
quence of contents. 
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5 Experimentation 

The motivation of the experimentation in this work is to analyze how quickly the 
system is able to adapt its behavior to the students needs (the size of the System 
Training phase, measured in number of students needed to converge). 

All the experimentation has been performed over simulated students in order to 
theorize about the system behavior when interacting with real students. How the be- 
havior of the students has been simulated is explained in section 5.1. We have empiri- 
cally tested with a variety of clusters types, from very homogeneous to heterogene- 
ous. Therefore, we believe our conclusions are quite general. 



5.1 Simulated Students 

In order to obtain general conclusions, a great amount of experiments are necessary, 
and, then, lots of students are needed to interact with the system. 

Most of researchers in educational systems use simulated student in order to prove 
the applicability of their systems, as was done in Beck’s seminal work [1], based in 
two important motives: 

First, it is difficult to persuade one person to use an application that could not be 
optimized, moreover when the application is an educational system that requires a 
high concentration (the application tests the knowledge of the person). To persuade a 
hundred of persons is unthinkable. 

Second, the experimentation with real students has a high cost, because they spend 
a lot of time at the interaction and sometimes they abandon when they get bored or 
they notice that the system does not teach the content in the best way. 

In this paper we want to test the system under very different conditions in order to 
draw general conclusions. We have use simulated students taking into account only 
two parameters in order to define the behavior of the simulated students: the student 
hypermedia format preference (text, video, image, etc.) and the student type of con- 
tent preference (definition, introduction, etc.) as it has been defined in section 4.1. 

Moreover, expert knowledge (knowledge of a database design teacher at the Carlos 
III University of Madrid) has been used in order to define the prerequisite relation- 
ships between the topics and the elements (tasks) of the topics. In this sense, when an 
hypermedia page of a certain topic. A, is shown to a student that do not knows the 
prerequisite topics of A, the student is not able to pass the exam associated to this 
topic (he could not learn this topic). Every simulated student uses this prerequisite 
table so that simulation results are reasonable and similar to what could be obtained 
from current students. 
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5.2 Experimentation Method 

Although simulated students are used in these experiments, real situations are going 
to be studied: when the system interacts with different students and learns in each 
interaction according to the Q-learning algorithm. 

The experiments are going to study three important issues: First, the number of 
students needed in order for the system to learn the action policy for a kind of cluster 
without initial pedagogical knowledge, i.e. initializing all the entries of the Q table to 
zero. Second, if the number of student required could be reduced by initializing the 
action policy with pedagogical knowledge. This could be done by training the system 
with simulated students which are supposed to model real students. The third issue is 
what should happen if the model does not represent the real student characteristics. 
This is equivalent to a situation where the system has been trained with students as- 
sumed to belong to Cluster Type 1 and the real students belong to Cluster Type 2. 

For the experiments, the learning rate parameter (a) of the Q-learning function 
has been fixed to 0.9 and the Boltzmann exploration/exploitation policy is followed. 
Notice that initially the system has no knowledge at all about teaching strategies. 

Due to the stochasticity of the domain, experiments have been carried out ten 
times, and the average produced when the system learns to teach with simulated stu- 
dents has been shown. 



5.3 Results 

In Figure 2. A, the system convergence is shown when 200 cluster! students interacts 
with RLATES without initializing the Q table, for different temperature values. It is 
shown that the system only needs around 50 students in order to learn the optimal 
policy, obtaining the best result when the temperature parameter value is 0.01. With 
this value, an average of only 20 actions are required to teach the 11 topics of the 
domain module when cluster! students interact with RLATES. 




Fig. 2. A: cluster! to cluster2 system convergence; B: cluster2 to cluster! system convergence. 



Eigure 2.B shows the system convergence when 200 cluster2 students interact with 
the system. The convergence property is only achieved when the temperature pa- 
rameter is 0.01, requiring around 60 students. RLATES needs an average of 30 ac- 
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tions to teach only 11 topics, given the high heterogeneity of the students learning 
characteristics of this cluster type. 

Although this results are reasonable, we are interested in reducing the number of 
students needed in the System Training phase by initializing the Q table with the 
pedagogical knowledge learned interacting with other students. If the initialization is 
correct according to the learning characteristics of the real students. Training phase 
will be eliminated. Obviously, this assumption is very strong and usually, a pedagogi- 
cal policy adaptation could be required in order to achieve the best curriculum se- 
quence for current users. This is illustrated in Figure 2. A, where cluster2 students 
begin to interact with RLATES when it has previously learned the optimal teaching 
policy for cluster 1 students, and vice versa in Figure 2.B. In these figures, it can be 
observed how only 20 students are needed for the Training phase, even when this 
initialization was not good for the current students’ learning characteristics. In figure 
2.B, we can observe the system does not converge when the temperature is 0.01, 
because it behaves almost greedy. 

Figure 3 summarizes all these results when the temperature is 0.1, showing the 
Training phase length for each cluster of students when no initialization is performed 
and when a bad initialization is done. Notice that if a correct initialization is per- 
formed, Training phase length is zero. This situation is the best situation for 
RLATES. 



I ♦ clusteM without 

' initialization 

— clusterl with 
I initialization 

A cluster2 without' 

initialization 
cluster2 with 
' initialization 



0 50 100 150 200 

number of actions 




Fig. 3. Experiments summary 



6 Conclusions and Further Research 

This paper presents the application of the reinforcement learning model in an intelli- 
gent and adaptive educational environment. Moreover, experiments that prove the 
convergence of RLATES, the importance of a previous student clustering and the 
initialization of the pedagogical strategies have been presented and analyzed. 

Simulated students have been used because it is necessary to tune all the learning 
parameters before using the system with real students. Also, we intended to draw 
general conclusions for RLATES acting under different conditions. 

The experiments prove three important issues. Eirst, the system automatically 
adapts its teaching tactics according to the current student learning characteristics. 
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whatever the initial situation was. Second, RLATES adapts to the set of all the stu- 
dents, adapting better to the students when all of them have similar learning charac- 
teristics. This motivates a previous student clustering, although it is not necessary for 
the system. And third, the system reduces its Training phase (interacting with fewer 
number of students) when a initialization of the pedagogical strategies has been done, 
even for bad initializations (with simulated students with different learning charac- 
teristics than real ones). This can be achieved by using simulated students, before the 
system is put to real use with actual students. Even bad initializations (with students 
with different learning characteristics) will be better than no initialization at all. 

Nowadays, we are involved in the evaluation of the proposal with real students, 
initializing the pedagogical knowledge with simulated students. It is believed that this 
initialization will reduce the interaction time with real students until the system con- 
vergence. Eor this validation we are designing an experiment with students of our 
Faculty belonging to different courses (intermediate and advanced courses of data- 
base design) in the Computer Science degree. 
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Abstract. Digital wireless communication systems can be regarded as 
solving a statistical learning problem in real time. The sender-side pro- 
cess of encoding and/or modulating information to be sent can be viewed 
as generation process of training data in the statistical learning point 
of view, while the receiver-side process of decoding and/or demodu- 
lating the information on the basis of possibly noisy received signals 
as the learning process based on the training data set. Based on this 
view one can analyze digital wireless communication systems within the 
framework of statistical learning, where an approach based on statistical 
physics provides powerful tools. Analysis of the code-division multiple- 
access (CDMA) user detection problem is discussed in detail as a demon- 
strative example of this approach. 



1 Introduction 

Various information processing tasks naturally fit into the framework of sta- 
tistical inference. Nevertheless, this view was not considered seriously in many 
occasions in the past because formulation of a problem based on statistical in- 
ference often leads to a prohibitively large amount of computation required to 
solve the problem. However, recent advances in information technology allow 
us to reconsider the framework of statistical inference to apply to problems of 
information processing. 

Digital wireless communications consist of various information processing 
tasks, many of which are best treated within the probabilistic framework. This 
is because digital wireless communications have to deal with various types of 
noise which are hard to characterize, so that they should be treated as stochastic 
processes. Rapidly expanding markets of various applications of digital wireless 
communications, including mobile phones and wireless LANs, are demanding 
analysis and design of larger-scale digital wireless communication systems. 

An objective of this paper is, first of all, to demonstrate that communica- 
tions based on the code-division multiple-access (CDMA) technology, which is 
increasingly adopted in several third-generation commercial mobile-phone ser- 
vices, can be viewed as an instance of statistical learning problem. One can 
expect that this view is efficient in analyzing systems’ performance theoretically, 
as well as in deducing design policies to be used in formulating an algorithm to 
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solve a problem in CDMA communication systems, since it allows us to make 
use of knowledge accumulated so far in the field of statistical learning. The most 
nontrivial contribution to the field of statistical learning would be that from 
statistical mechanics [1], which suggests that various statistical-mechanical tools 
and concepts might be efficiently applied as well to problems of analysis and de- 
sign in the field of CDMA communications. The second objective of this paper 
is to show that it is indeed the case, by reviewing an analysis on CDMA systems 
using the replica method, a tool for analysis originally developed in statistical 
mechanics. 

This paper is organized as follows. In Sect. 2 we briefly review the basic 
framework of statistical learning theory. In Sect. 3 we introduce CDMA channel 
model and user detection problem, which are to be dealt with in the rest of the 
paper. The link between user detection problem and statistical learning is also 
established in this section. Section 4 is devoted to giving a brief explanation as 
to the fact that communications using error-control coding can also be linked to 
the framework of statistical learning. Returning to the main subject of CDMA, 
Sect. 5 first introduces Bayesian framework for the user detection problem, and 
then describes performance analysis of the optimum user detection scheme for 
the CDMA channel model. Appendix is devoted to give a more detailed descrip- 
tion as to how the replica method is applied to the analysis of the optimum user 
detection scheme. 

2 Statistical Learning Theory 

We briefly review the basic framework of statistical learning theory. The basic 
setup of statistical learning consists of a “teacher” system, a “student” system, 
and a training data set. The student system is supposed to “learn” the teacher’s 
behavior, by only observing the training data set, which is generated by the 
teacher. 

Assume that the teacher has a stochastic input-output relation, which is 
characterized by a conditional distribution po{y\x), where x and y denote input 
and output of the system, respectively. One may wish to deal with a deterministic 
teacher whose input-output relation is not stochastic but deterministic, but we 
can “stochastify” such a deterministic system by adding small noise to it, so 
that the deterministic-teacher case can be treated within the same framework 
by regarding it as a zero-noise limit of such stochastifled teachers. 

A training data set T = {(a;^, = 1, . . . , N} consists of N pairs of 

input Xfj_ and output The output y^ is assumed to be generated by the 
teacher system according to the conditional distribution pQ{y\x^). The inputs 
xi, ... , xn are assumed to be realizations of independent and identically dis- 
tributed (i.i.d.) random variables following a distribution po{x). We also let 
Tx = y=l., , N} and Ty = {j/^; y = 1, . . . , N}. 

The student system also has a stochastic input-output relation, which is char- 
acterized by another conditional distribution p{y\x). The conditional distribution 
p{y\x) is assumed to be of a parametric form with a parameter, denoted by b in 
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this paper, so that the student can vary its behavior by adjusting it. The task of 
“learning from examples,” which the student is supposed to perform, is to find 
a value of the parameter b which best approximates the teacher’s behavior in 
view of the training data set T. 

Bayesian framework is best suited to treat the problem of learning from ex- 
amples. Let us assume that the student has a prior distribution p{b) for the 
parameter b. Then, applying Bayes’ theorem, one obtains the posterior distribu- 
tion of b conditioned on the training data set T, as 



p{b\T) 



p{Ty\T^, b)p{b) 
J2bP(Ty\T^, b)p{b)' 



( 1 ) 



The conditional distribution p{Ty\T^^ b) is given by 



N 

p{Ty\T^, b) = Y[p{yy.\Xf,, b), 



( 2 ) 



where the notation p{y\x, b) is used in order to explicitly show that the condi- 
tional p{y\x), characterizing the student’s input-output relation, depends on the 
parameter b. Based on the posterior p{b\T), one can construct various estimators 
for b. They include: 

— Maximum a posteriori (MAP) estimator: 

liMAP) ^ argmaxp(&|T) 

b 

— Marginal posterior mode (MPM) estimator if 6 is a vector (6i, ... , bK)'- 

SyPM) = argmax p(&|T), k=l,...,K 

bk ' ^ 
b\hk 

— Posterior mean estimator: 



^(PM) ^ ^ 
b 

In order to quantify how good an estimator is, one needs a model for the 
teacher system. Let us assume that the teacher’s conditional distribution is also 
parametrized hy b as po{y\x, b), and that the true prior distribution of b is po(b). 
It should be noted that we make distinction between the true prior po(b) and that 
assumed by the student p(b), as well as between the true conditional po(yjx, b) 
and that assumed by the student p(yjx, b), in order to deal with situations in 
which the student’s assumptions about the teacher are incorrect (i.e., p(b) yf 
Po(b) and/or p{y\x, b) ^ pQ{y\x^ b)). Let b) denote a loss function defining 

a loss incurred by estimating b as The expected loss of an estimator for 
a given input set is 

L{-\T.) = Y.po{b)po{Ty\T,, b)£{U-\ b). 

b,Ty 



( 3 ) 
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Since we assume inputs of the training data set to be generated randomly, 
one can further take its average over T^, which is given by 

L{-) = '£po{T,)L{-\T,), (4) 

where po{Tx) = 0^=1 -Po(a^/r) denotes the distribution generating T^. It should 
be noted that in evaluating the expected loss the expectation should be taken 
with respect to the true distributions, not those assumed by the student. It is 
known that the MAP estimator the MPM estimator 5 (mpm)^ 

posterior mean estimator are optimum if one takes j,, and (6— 

as the loss functions, respectively, and if the student’s assumptions about the 
teacher are correct. 

3 Code-Division Multiple-Access Communication 
Channel 

In digital wireless communications, such as mobile phones and wireless LANs, it 
is not unusual to have more than one agent which communicates with the same 
receiver (a base station in cellphone systems or a wireless access point in wireless 
LAN systems). The term multiple- access refers to such a situation. The multiple- 
access interference (MAI) naturally occurs in a multiple-access environment, so 
that it is necessary for the receiver to have some means to mitigate the MAI. 

CDMA is one of the technologies to alleviate the MAI. The basic fully- 
synchronous AT-user CDMA channel model [2] (see Fig. I) is represented by 

1 ^ 

p = l,...,N. (5) 

bk is an information symbol of user k {k = 1, . . . , K), which is assumed to 
follow a prior distribution pofc(^fc)- The information symbol undergoes the so- 
called spreading modulation, in which it is multiplied by the signature sequence 
{sfc/i! M = Ij • ■ • ) N}. The modulated sequence, {bkSkfi', p = 1, ■ . ■ , N}, is 
then transmitted to a receiver. The received sequence, denoted by {y^; p = 
1, . . . , N}, is a superposition of K signals, each coming from each user. Ak 
denotes the amplitude of the signal sent by user k at the receiver. We also take 
into account additive channel noise, which is denoted by p = 1, . . . , N}. 
Instances of the channel noise are assumed to be i.i.d. random variables following 
a distribution po{n). The factor 1/VN placed at the right-hand side of (5) is 
to normalize the power of the signature sequence to unity. Equation (5) is also 
expressed in a vector form as 

y = SAb + n, (6) 

where y = (yi, . . . , yw)^, S = (l/^/N){sk^i), A = diag(Afc), b = (bi, , bxY , 
and n = (rii, . . . , . The information symbol vector b follows the prior dis- 

tribution po{b) = Y\k=iPokibk) . The channel noise vector n follows the distri- 
bution po{n) = ri/j=i Po(ji^). At this point, without loss of generality we absorb 
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the prefactor Ak into bk and accordingly redefine the prior po{b). The CDMA 
channel model therefore becomes 

y = Sb + n. (7) 




Fig. 1. CDMA channel model 



The receiver has to solve the user detection problem. The user detection is a 
problem to estimate the information symbols b based on the received sequence 
y. In performing the user detection, the receiver is assumed to have perfect 
knowledge about the signature sequences S. A justification to this assumption is 
that real CDMA systems use pseudorandom sequences as signature sequences, 
so that the receiver, knowing who are transmitting signals, can regenerate the 
same signature sequences as those used by the users sending signals, by using 
the same generators as the users do. 

The key observation is that the user detection problem can be cast into the 
framework of learning from examples discussed in the previous section. Equa- 
tion (7) can be further rewritten as 

yf, = ^slb + n^, y=l,...,N, (8) 

where sj = . . . , SKfj.)- At this point, one can regard that (8) is defining a 

“teacher” system with input s^, output j/^, and parameter b, whose stochastic 
input-output relation is characterized by the conditional distribution pQ{y\s, b) 
as 




Po{y\s, b) = po[y 



(9) 
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and regard T = {(s^,j/^); /r=l, ... , N} as a training data set, which is gener- 
ated by the teacher. The receiver’s task of user detection is to estimate b based 
on T, which is equivalent to what is done by the student in the framework of 
learning from examples. A receiver in CDMA systems can therefore be regarded 
as solving the problem of learning from examples. It should be noticed as well 
that a receiver not only solves the learning problem, but also does it in almost 
real time, and with acceptably high quality of solutions, both of which should be 
certainly fulfilled in any commercial communication service. 

4 Error-Control Coding 

In this section we briefly explain that communications using error-control coding 
can also be cast into the framework of statistical learning. Let us consider a 
linear block code and a memoryless channel. A linear block code is characterized 
by its generator matrix G, which operates on an information block b to yield 
a codeword w = G^b mod 2, of length N . This relation can equivalently be 
rewritten as = gj^b mod 2, g = 1, N , where and denote /x-th 

member of the codeword w and /i-th column of the generator G, respectively. The 
input-output characteristics of a memoryless channel is defined by a conditional 
distribution p(i{y\w), where y and w denote an output alphabet and an input 
alphabet of the channel, respectively. 

One can let 

Po{y\g,b) = pQ{y\w), w = g^b mod 2, (10) 

and 

T = {{g^,y^)-p=l,...,N}. (11) 

The task of decoding is to estimate the original information block b based on the 
received word y = (j/i, . . . , j/at)^ with the knowledge of the code, i.e., G. This 
is equivalent to saying that the decoder should estimate b based on a “training 
data set” T, which, once again, can be regarded as an instance of the problems 
of learning from examples. 

Having established the relation between error-control coding and statistical 
learning, it is natural to expect that various research fields, which have con- 
tributed to the field of statistical learning, will contribute to studies of error- 
control coding as well. In fact, statistical-mechanical approach to analyzing error- 
control coding is possible, yielding several nontrivial results. In this paper, we 
do not go into details any further about this subject: See [3] for a comprehen- 
sive review on statistical-mechanical approach to a family of error-control codes 
called low-density parity-check codes. 

5 Performance Analysis 

One can pose several questions regarding the user detection problem of CDMA 
systems. Since the optimum user detection problem has been proved to be NP- 
hard [4] , efforts have been mainly directed toward designing efficient sub-optimal 
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user detection algorithms which allow operation in nearly real time. However, 
from a theoretical point of view, it is important to ask the theoretical infor- 
mation transmission capability of the CDMA channel model. Answering this 
question turns out to be difficult, however: Although writing down formulas 
which give performance measures is not difficult, the amount of computation 
required to evaluate them typically scales exponentially as the system size in- 
creases, so that it soon becomes prohibitively large. Moreover, those formulas 
provide us with little theoretical insights into the structure of the user detection 
problem. Researchers have therefore sought a way of resolving this difficulty, 
and one important observation along this line is that the performance exhibits 
self-averaging, that is, if one adopts the so-called random spreading assumption, 
in which one regards the signatures S as randomly generated and considers av- 
erages over the randomness of the signatures, then the performance of a system 
with a particular realization of S does not deviate much from its average over S, 
and the former converges in probability to the latter in the large-system limit, 
in which one considers the limit K, N ^ oo while their ratio (3 = K/N is kept 
finite. Although the approach based on the random spreading assumption as well 
as the large-system limit is expected to be promising, until recently there were 
almost no theoretical results as to the optimum user detection scheme. A notable 
exception is the work by Tse and Verdu [5], in which they analyzed performance 
of the optimum user detection, but only in the zero-noise limit. 

Performance evaluation for the case of finite noise has become possible by 
applying tools and notions originally developed in statistical mechanics. In par- 
ticular, application of the replica method to the performance analysis of the op- 
timum user detection scheme has successfully resolved the difficulty mentioned 
above, further providing insights into the problem [6]. It should also be noted 
that approaches on the basis of statistical mechanics are efficient not only for 
theoretical performance evaluation but also for designing algorithms operating 
nearly real time and at the same time achieving close to theoretical performance 
bounds. The issue of designing algorithms on the basis of statistical mechanics 
will be discussed in [7]. 

In the following, we review the replica analysis of the optimum user detection 
scheme for the CDMA communication. We first introduce the Bayesian frame- 
work to the user detection problem. The receiver, acting as a student, assumes 
a conditional distribution, parametrized by b, as 

p{y\s,b) = p(y- (12) 

where p(n) denotes the student’s assumption about the channel noise distribu- 
tion, which may be different from the true noise distribution po{n). The receiver 
also has a postulated prior p{b) of the parameter b, which is assumed factorized 
as Uk=lPk{h)- One can then construct the posterior distribution of b condi- 
tioned on T = (S', y) as 

p{y\S, b)p{b) 

J2bP(y\S, b)p{b)' 



p{b\T) 



(13) 
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which is to be used to estimate the parameter b based on T. 

The basic quantity for evaluating the information transmission capability of 
a CDMA channel is the mutual information I(b; y) between the information 
symbols b and the received sequence y. The mutual information is decomposed 
into two terms, as 



I{b-, y) = H{y) - H{y\b), (14) 

where H{y) is the entropy of the received sequence y and where H{y\b) is the 
conditional entropy of y conditioned on the information symbols b. H(y\b) is 
nothing but the entropy of channel noise n, 



H{y\b) = NH{n) = -N 



po(n) logp(n) dn, 



(15) 



so that what remains is to evaluate H{y). In view of the random spreading 
assumption and the large-system limit, we instead evaluate the per-user entropy 
averaged over S in the limit AT — >■ oo: 



% = lim Es 

K —¥00 






= - lim Es 

K—¥oo K 



Po{y\S) \ogp{y\S)dy 



(16) 



Application of the replica method yields the following proposition: 

Proposition 1. The averaged per-user entropy in the large-system limit, TL, is 
given by 



JJ Po(y- log p(y - \fl3qt) Dt dy 



= II po\y- 

-l-^Gr -I- Em — 

“Epo.p'l 'y ] po{bo) 

I bo 

X ^logY^p{b)exp[^^^b^ + {^/Fz + Ebo)b\Dz\, (17) 

b 

where po{n) and p{n) are defined as 



rG-E, 



Po(n) = J Po(n- ^/3(ro - 

p{n) = J p{n - \/ (3{r - q)u) Du, 



(18) 



(19) 
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and where the values of {rg, r, to, q, G, E, F} are to be determined by solving 
the RS saddle-point equations: 



^0 = Epo 


'^Po{bo)bl , 


r — Epo,p 


-bo 


- 


^0 






'^Pibo) j {bob) Dz 
- ^0 


, q = Epo.p 


J {b)^Dz 

- bo 



G = 
E = 
F = 



Po y 



Po\ V 



po\ y 



p{y - '/Jkt) 



p{y - '/Jkt) 



l^t) Dtdy. 

Q ) \p{y- y/Wlt) ) 



Dz = {1/ /'^dz denotes the Gaussian measure. Epo,p(-) denotes taking 
average over distribution of {pok{bk) , Pk{bk)}- The symbol (• • • ) denotes average 
of the form: 



exp(^^^b"^ {y/F z Ebo)b"j 
+ {y/Fz Ebo)b'^ 



Derivation of the result is briefly described in Appendix. 

The replica analysis also leads to a quite interesting consequence that, under 
the random spreading assumption and in the large-system limit, the optimum 
user detection scheme effectively reduces the problem into an equivalent single- 
user problem. The discussion in the following is an extension to that due to 
Guo and Verdii [8] . Let us consider a single-user problem in which the informa- 
tion symbol b is generated according to the true prior distribution po{b) and is 
transmitted via a Gaussian channel so that the received signal y is given by 

y=^b+z, 0~iV(O, 1). (21) 

The signal-to-noise ratio (SNR) of the Gaussian channel is {E'^ / F)^^b'^pQ{h) . 
In the Bayesian framework the receiver is also equipped with a postulated prior 
as well as a postulated channel model. The postulated prior is denoted by p{b) 
and the postulated channel is again a Gaussian channel described by 

y=yfFb-\-z, 2 ^ iV(0, 1). 



( 22 ) 
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Furthermore, the signal-to-noise ratio (SNR) of the postulated Gaussian channel 
is F^f^b‘^p{b). The posterior mean estimator based on the postulated model is 
defined as 



b EbL'(^)exp[-(j/- /2] 



(23) 



Identifying y = {E/'/F)bo + z in view of the true channel model (21), and 
p{b) = p{b)e^^ one finds that the posterior mean (23) is equivalent to (20). 
This correspondence can be established in a more rigorous way, so that we have 
the following proposition: 

Proposition 2. Under the random spreading assumption and in the large- 
system limit, the optimum user detection scheme is equivalent, as far as one 
user out of the K users is concerned, to the estimation problem on a single-user 
Gaussian channel characterized by a true and a postulated priors, po(b) and 
p{b) = p{b)e'^ ^ as well as a true and a postulated Gaussian channel models, 
as described in (21) and (22), where po{b) and p{b) are the true and postulated 
priors for the user under consideration in the original GDMA channel model. 

It should be noted that the reduced single-user channel is Gaussian even though 
the channel noise distributions in the original GDMA channel model, po{n) and 
p(n), are not Gaussian. 

As a simple corollary to Proposition 2, one obtains the following result, which 
reproduces a result reported by Guo and Verdii [9]: 

Corollary 1. Let pok{bk) = Pk{bk) = {l/2){6{bk - Ak) -\- 5{bk + Ak)), Ak > 0. 
Then the probability of error P in estimating bk, the information symbol of user 
k, is given by 



(24) 

where 

poo 

Q{z) = / Dx. (25) 

J Z 

An example of performance evaluation is shown in Fig. 2, in which the same 
Gaussian distribution N{0, <t^) is assumed for both po{n) and p{n), and the prob- 
ability of error P is plotted versus channel SNR Ei,/No = 1/(2ct^). A nontrivial 
aspect observed in the result is that the performance curve becomes S-shaped 
as (3 increases. See [6] for discussion about the significance of this observation. 



6 Summary 

We have shown that the user detection problem can be regarded as an instance 
of problems of “learning from examples” in statistical learning theory. Next, 
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E^/Nq [dB] 



Fig. 2. Probability of error P versus SNR Eh /No, evaluated by replica analysis. Results 
for /9 = 1, 1.2, 1.4, 1.6, 1.8, and 2 are shown. Dashed line represents single-user bound. 



we have given a brief description as to how the replica method of statistical 
mechanics can be applied to analysis of the optimum user detection scheme for 
CDMA communications. We have finally mentioned the consequence, deduced 
from the replica analysis, that, under the random spreading assumption and in 
the large-system limit, the optimum user detection scheme effectively reduces the 
problem of user detection into a problem of estimating an information symbol 
in an equivalent single-user Gaussian channel. 
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A Replica Analysis 



In this appendix we give a brief description as to how the replica calculation 
goes in evaluating H as defined in (16). Our starting point is the identity 



d 1 

-H = - hm — hm — log 
n —^0 on iC— >oo K 



(26) 



where we let 



= Es 



Po{y\S)[p{y\S)Y’ dy 



(27) 



First, we evaluate Wthk^oo K ^logr;„, under the assumption that n is a 
positive integer. Under the assumption, one can rewrite the integrand of (27) as 

n 

Po{y\S)[p{y\S)]'^ = po{bo)po{y\S, bo)Y[[p{ba)p{y\S, ba)]. (28) 

bo, ... , bn a=l 



Using this expression, is rewritten as 

n « r n 

^n= Y Po(bo) / Es Po{y\S, bo) J]^p(y|S', &„) 



bo, ... , bn 



a—1 



a—1 



dy. (29) 



Since Sf^, p = 1, . . . , N, are i.i.d., the integral in (29) can be evaluated for each 
p separately, to yield 



Es 



Po{y\S, bo) J]^p(y|S', ba) 



a—1 



dy = e^^, 



(30) 



where we let 



= / E« 



Po{y\s, bo) ^a) 



dy. 



(31) 



Let us assume that e® depends on {bo} only through an empirical mean 



1 ^ 

Q — ^ ^ Q(bofc, bi/„, . . . , bjik) 



k=l 



(32) 
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of a function Q{bo, &i, ... , bn) (which is typically vector- valued, but may be 
an infinite-dimensional functional). This is indeed the case because, for fixed 
{6 q, . . . , bn} and under the random spreading assumption, the quantities 









0 = 0 , 1 , 



n, 



(33) 



can be regarded, in the limit N — >■ oo, as Gaussian random variables with 
£(■00) = 0 and Cov(uo, Vb) = K~^ bakbtk, so that one can take 



Q{bo, 61, ... , bn) = 



/ bl bobi ■ ■ ■ bobn\ 

bobi bl ■■■bibn 

\bobnbibn--- bl ) 



(34) 



which makes Q the covariance matrix of the Gaussian random variables v = 
(vo, vi, . . . , u„)^. The expectation with respect to s in (31) now corresponds to 
that with respect to the Gaussian random variables v, so that the result will be 
a function of Q, which is denoted by The quantity is thus expressed 

as 






(35) 



where 



IJ-k{Q)= X! Po(^o) TT Qjbok, ■ ■ ■ , bnk) - KQ 



a—1 



K 






(36) 



is a probability weight, with respect to the measure Po(bo) 0"=! P(^a), of a 
subshell specified by Q, 



S{Q) = 




Q 



j^'^Qibok, 




(37) 



If the probability measure iJ-k{Q) follows the large-deviation principle in the 
limit K — >■ 00, with a rate function T(Q), one can apply the saddle-point method, 
or Varadhan’s theorem [10], to obtain 

lim K~^logE:n = sup[P~^G{Q) -I{Q)]. (38) 

K^OO Q 

We further assume that the cumulant generating function, defined as the limit 

A{Q) = lim —log ^ Po(^o) TTp(^>a) 

K—yoo K ^ ' -*■-*- 



(39) 
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exists. Then, from Gartner-Ellis theorem [10], the rate function X{Q) is given 
by the Fenchel-Legendre transform of Tl(-), as 

I{Q) = sup [Q • Q - A{Q)] . (40) 

Q 

Summarizing these results, one obtains 

lim logS'n = supinf [/3“^C/(Q) - <5 • <5 + 7l(Q)] (41) 

K^OO Q Q 



The saddle-point equations, determining a saddle-point solution (Q, Q), are 
given by 



Q-r'‘W. 



dQ 



dQ 



Let 



■«) = Pa{y - \fPvo) p{y - \/Pva), 

a—1 

then one can write G{Q) as 

gSCQ) = J v)]dy, 



(42) 



(43) 



(44) 



where E«[-] denotes taking average over Gaussian random variable v ~ 1V(0, Q). 
Defining operators Da, a = 0, . . . , n, as 



DoTZ{y, v) 



DaTl{y, v) 



P'oiy - y/Pva) 
Po{y- y/Pva) 
p'{y - \fPvg) 
p{y - \fPVa) 



Tl{y, v), 

Ti{y, •«), 



a = 1, . . . , n. 



(45) 



one obtains, as saddle-point equations. 



Qaa ^ J ^'v[Dln{y, u)] dy, 

Qab = e~^ j E-u [DaDbTZ{y, t>)] dy, a b. (46) 



Another set of saddle-point equations is 

Qab EpQ,p [((l^a^b))] ; 

where 

^ Ebo....,bS--)Poib)U:=iPiba)e^^^'>^^’’^^^^ 
and where Epo,p(’) denotes taking average over distribution of {pok{bk), Pk{bk)}- 



(47) 



(48) 
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To proceed further, we assume the replica symmetry (RS), which amounts to 
assuming that the saddle-point solutions are invariant under exchange of replica 
indexes a = 1, . . . , n. Specifically, we let 



Qoo — 


^’0, 


Qaa — ^ 5 


Qoa — ^7 


Qab — Qf 




Qoo = 


Go 
2 ’ 


G -- 

^^:aa — 2 ’ 


Qoa — E' , 


Qah = ^7 


(49) 



where a, b ^ 0 and a ^ b. Under the RS assumption, the following representation 
can be introduced to define the Gaussian random variable v with the desired 
covariance structure: 




Va = Vr - qUa + \/qt, (50) 

where t and Ua, a = 0, . . . , n, are standard Gaussian random variables. Intro- 
ducing po{n) and p{n) as defined in (18) and (19), one obtains 



eS(Q) = 




[p{y- Dtdy. 



(51) 



The cumulant generating function A(Q) is evaluated under the RS assumption, 
by making use of the so-called Hubbard-Stratonovich transform to linearize ex- 
ponents. As a result, one has 



^ bo 



E 



p(b)e 



(G-F)b^ /2+{VFz+Ebo)ba 



Dz 



(52) 



^ L f, J > 

Putting these results together, and then taking derivative with respect to n and 
the limit n — >■ 0, we arrive at the final result, as summarized in Proposition 1. 
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Abstract. Although the Bayesian approach provides optimal perfor- 
mance for many inference problems, the computation cost is sometimes 
impractical. We herein develop a practical algorithm by which to ap- 
proximate Bayesian inference in large single-layer feed-forward networks 
(perceptrons) based on belief propagation (BP). Although direct appli- 
cation of BP to the inference problem remains computationally difficult, 
by introducing methods and concepts from statistical mechanics that are 
related to the central limit theorem and the law of large numbers, the 
proposed BP-based algorithm exhibits nearly optimal performance in a 
practical time scale for ideal large networks. In order to demonstrate 
the practical significance of the proposed algorithm, an application to a 
problem that arises in a mobile communications system is also presented. 



1 Introduction 



Learning and inference in probabilistic models are important in research on ma- 
chine learning and artificial intelligence. In a general scenario, a probabilistic 
relationship between an fV-dimensional input vector x and output y, which may 
be discrete or continuous, and sometimes multi-dimensional, is obtained by a 
conditional distribution P{y\x^w), where w denotes a set of adjustable param- 
eters. Prior knowledge concerning w may be represented by a prior distribution 
P(w). Under such circumstances, a learner is required either to infer the machine 
parameter w or to predict the output y'^^^ of the p + 1-th input based on 
a set of p training data (or examples) = {(jc^, y^), (a;^, y^), . . . , (xP, y^)}. 

The Bayesian formula for evaluating the posterior distribution 



P{w\DP) 



P{DP\w)P{w) 

P{DP) 



f dwP(w) 1 PiyPjxP, w) 



( 1 ) 



is important in such tasks. If the goal of the learning is to accurately estimate 
the machine parameter w, then the optimal estimator can be designed 

for a given loss function L(w, w), which represents the deviation between w and 
its estimation ih, so as to minimize the expected loss as 



^ Bayes _ 

w ^ = argmin 

w 



dwP{w\D^)L{w^w) > , 



(2) 
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utilizing the posterior distribution (1) [1]. Predicting the output for a new in- 
put is another possible goal of the learning. In such cases, a loss function 
L{yP^^,y) is defined for y^~^^ and its estimation y, to which the machine param- 
eter w is not directly related. However, the optimal predictor can also be 

constructed by minimizing the expected loss as 



yBayes ^ ^rgmin 1 1 dyP{yP+^\xP+\DnL{yP+\y)^ , 
where the Bayesian predictive distribution [2] 

P{yP+^\xP+^,DP) = j dwP{w\DP)P{yP+^\xP+^ ,w), 



( 3 ) 

( 4 ) 



is evaluated from the posterior distribution (1) as well. 

Equations (2)-(4) indicate that performing averages with respect to the pos- 
terior (1) is fundamental in such optimal strategies. Unfortunately, this becomes 
computationally difficult as the dimensionality of w increases. Therefore, pre- 
viously, the Bayesian approach was considered to be impractical except with 
respect to small models that contain only a few parameters, despite being rec- 
ognized as optimal. However, recent cross-disciplinary research in areas such as 
machine learning [3, 4,5, 6], error correcting codes [7,8,9], and wireless communi- 
cations [10,11], has revealed that methods from statistical mechanics (SM) can 
be useful in developing efficient approximation algorithms for computing aver- 
ages and that the difficulty associated with the application of Bayesian inference 
can be practically overcome for several applications [12]. 

The purpose of the present study is to demonstrate one example of such 
an application. More specifically, we develop an efficient algorithm that ap- 
proximates Bayesian inference on a practical time scale for single-layer feed- 
forward networks (perceptrons) . The algorithm is based on belief propagation 
(BP), which was developed while researching graphically represented probabilis- 
tic models [14]. Belief propagation can be carried out on a practical time scale 
when the posterior distribution is pictorially represented by a sparse graph. Un- 
fortunately, the posterior distribution of the learning problem of perceptrons 
corresponds to a dense graph, which implies that the direct application of BP 
is still computationally difficult. However, we demonstrate herein that this diffi- 
culty can be overcome using methods and concepts from SM and that the per- 
formance of the proposed BP-based algorithm is nearly optimal for ideal cases 
of large networks. 

The present paper is organized as follows. In the next section, we introduce 
the learning problem for perceptrons that will be considered in the present study. 
In Section 3, the proposed learning algorithm based on BP is developed using 
techniques from SM. The properties of the proposed algorithm are also examined 
in this section. In Section 4, an application to a problem that arises in a mobile 
communications system is presented in order to demonstrate the significance of 
the proposed algorithm to real-world problems. The final section summarizes the 
present study. 
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2 Learning of Perceptrons 

In the present study, we focus on a learning machine (probabilistic model) of the 
single-layer-perceptron-type that is specified by a conditional distribution 

7/1 . nr* 

P{y\x,w) = g{y\^), (5) 

where w = {wi,W2, ■ ■ ■ ,wn) denotes the adjustable machine parameters. The 
parameters can be either continuous or discrete. However, in the present study 
we assume a factorized prior distribution 

N 

P(u;) = (6) 

i=l 

for computational tractability. For simplicity, we also assume that the input 
vectors {y, = 1,2, ... ,p) of the training set are independently generated 
from an identical uniform and uncorrelated distribution that is characterized by 
Xi = 0 and xiXj = 5ij, where (• • • ) denotes the average of (• • • ) over a certain 
distribution of examples. Extension to correlated distributions that have non- 
diagonal covariance matrices is possible at the expense of a slight increase in 
computational cost [4,16,17]. 

In many cases, the averages relevant to the optimal inference strategy can be 
easily computed from the marginal posterior distribution 




Notice that the evaluation of this distribution is still computationally difficult 
in spite that several assumptions are introduced. Therefore, we develop an al- 
gorithm by which to efficiently evaluate Eq. (7). For this task, we select belief 
propagation (BP) [14], or the sum-product algorithm [15], as a promising can- 
didate, since BP exhibits excellent performance in several applications [7,8,9, 
18]. 

3 A BP-Based Algorithm 

3.1 Graphical Representation 

For the application of BP to the present learning problem, we graphically rep- 
resent the dependences of w on the training set in the posterior distribution 
(1), as shown in Fig. 1 (a). In this figure, each component of parameters wi 
{i = 1,2, .. . ,N) is denoted by a node (circle) and each output of the training 
examples y^ {y = 1,2, ... ,p) is represented by a different type of node (square). 
The components of input vector xf are represented by lines, since Wi and y^ are 
directly related through Eq. (5). Since, generally, none of the individual input 
vectors vanishes, this construction scheme implies that the posterior distri- 
bution of the current perceptron learning is represented as a complete bipartite 
graph. 
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Fig. 1. Graphical representations of Bayesian inference problems. In the above graphs, 
variables to be estimated are denoted as circles and observed data are denoted as 
squares. Each line represents a direct relation between the unobserved variable and the 
data, (a) The inference problem in a perceptron is represented by a complete bipartite 
graph, the lines of which correspond to the components of input vectors xf. (b) A 
graph containing loops, (c) A loop-free graph. 



3.2 Direct Application of BP and Computational Difficulty 

In Fig. 1, BP is expressed as an iterative algorithm in order to compute Eq. (7) 
by passing certain positive functions and which are often 

termed messages, between the two types of nodes [14]. In the current system, 
the messages are updated as 

i¥=i 

where and t = 0, 1, 2, . . . is the number of updates. Here, and 

are constants such that the normalization condition f dwim^^i{wi) = 
f dwimi^fj_{wi) = 1 holds. Rather than a sequential update (in which only a 
single variable is changed per single update), we employ a parallel update (in 
which all of the variables are changed for each update). Using the messages 
hi^_>i(tCi), the marginal posterior distribution (7) is approximately evaluated as 

p 

P{wi\DP) = aiP{wi) ( 10 ) 



(8) 

(9) 



at the t-th update. 

If a graph is free from loops, then iteration of Eqs. (8) and (9) leads to the 
exact marginals (7) [14,8] (see Fig. 1 (b) and (c)). The required computational 
cost is practically tractable because the number of variable nodes linked to each 
example node is 0(1) and the evaluation of Eq. (8) is the most time-consuming 
task in BP. On the other hand, for general loopy graphs, BP can work as an 
approximation algorithm, although the convergence properties of Eqs. (8) and (9) 
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have not yet been fully clarified. However, excellent experimental approximation 
performances have been reported for randomly constructed sparse graphs [7,8, 
18,19], for which the BP updates are computationally feasible as well. This result 
may appear reasonable since the typical length of loops in randomly constructed 
sparse graphs diverges as O(lnA^) with respect to the number of nodes N [20], 
and the properties of such graphs are therefore expected to approach those of 
loop- free graphs when N is large. 

In contrast, dense graphs, including the complete bipartite graph of the 
present system, contain many short loops, which may cause the properties of 
the system to be very different from those of loop-free graphs. Furthermore, as 
all of the parameter nodes are connected to each example node, the evaluation of 
Eq. (8) is computationally difficult. As a result, the present attempt to develop 
an approximation algorithm based on BP does not appear to be promising for 
the current learning problem of perceptrons. 



3.3 Gaussian Approximation and Self- Averaging Property 

However, the computational difficulty of BP in the present problem when the 
size of network N is large can be resolved by applying methods and concepts 
from SM to obtain an algorithm that exhibits a nearly optimal performance 
with practical computational cost for ideal cases. This approach is similar to 
that introduced for characterization of the equilibrium state in [3,4,21]. Here, we 
show that a similar scheme is also useful for developing an efficient algorithm 
having a low computational cost that seeks the equilibrium state. 

We first expand the conditional distribution of each example fj, up to the first 
order of Wi as 



111 • 

g{y^\An g{y^\A^) + g'{y^\Af)^, ( 11 ) 

where Af = Next, we note that the 

integral over Wj^i in Eq. (8) has an intuitive implication of the average with 
respect to Wj^i over a factorial multi-dimensional distribution 
Equation (9) indicates that can be regarded as a marginal distri- 

bution that is determined by a data set from which (x^,y^^) is absent. Since 
inputs are assumed to be generated independently of each other, Wj^i is typ- 
ically not correlated with x^ when Wj^i is generated from 

therefore, Af can be treated as a Gaussian random variable due to the central 
limit theorem. The average of the Gaussian variable is obtained as = 

^- 1/2 Xj where (• • • )^ represents the average of (• • • ) with respect 

to the factorized probability Ylj^i On the other hand, evaluating the 

variance = N~^ {{wjWk)^ - is 

nontrivial. However, research in the area of statistical mechanics of disordered 
systems indicates that such variance converges to its expectation with respect 
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to the sample in various systems as 

= i^i;(w>l-(<0'). (12) 

i=l ^ ^ 

in the ‘thermodynamic limit’ N,p ^ oo with a. = p/lV fixed, which is referred 
to as the self- averaging property [3,4,21]. Therefore, we employ this property 
as an approximation for sufficiently large networks and further approximate the 

variance as ~ S'*, where 

1 ^ / 2 \ 



neglecting the 0{N~^) difference, where ((•••))* denotes the average of (• • ■ ) 
with respect to Eq. (10). 

Substituting Eq. (11) into Eq. (8) and replacing the multi-dimensional inte- 
gral with the average over the single Gaussian variable yields 



oc 1 -b 



9 In 



0 (y^\{A^^)\-S^) 



ai"[g(«'^i<^i‘)^g*)] 



xfwj 

yiv 






Vn 



_ 

= e 



(14) 



where 

g{y^\ {Af)l ; S‘) = / Dzg{y^\ {Af)^ + V^z), 
and Dz = ■ Inserting Eq. (15) into Eq. (9) yields 



ml^ Jwi) oc 



where is obtained as 






= V h*, = 



d\n[g (y’^\{A-)l-,S^) 









dm: 



(15) 



(16) 



(17) 



for each update t. Finally, the approximate marginal posterior distribution (10) 
is evaluated as 

P{w,\DP) = Z-Hhl)P{wi)e’^*^ 



where 



(18) 
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fi^i 



= E 



a In 



^1 






(19) 



and the normalization factor 

Z{hl) = J dw,P{w,)e^"^% 



(20) 



can be used to evaluate the approximate posterior average (iCi)* and macroscopic 
parameter S'* as 



and 



respectively. 



(Wi)* 



dln[Z{h^)] 

dh\ ’ 



1 ^ a^ln[Z(/i*)] 

mf 



( 21 ) 



(22) 



3.4 Further Reduction of Computational Cost 

Equations (14) and (16) constitute the fundamental component of the approx- 
imation algorithm. The necessary cost for computing these equations is 0{N) 
and 0{p) for each pair of indices i and /i, which means that the computation 
of 0{N^) is required for each update in the ‘thermodynamic situation’ in which 
a = p/N is 0(1). Although this may already be practically tractable, the com- 
putational cost can be reduced further. 

Here, we use the perturbation expansion 




(Wi)* 



{dhf)^ " 



(23) 



which is derived from Eqs. (17), (19) and (21). In addition, this implies 



(24) 

where ' I^iserting these into Eqs. (14) and (21) and 

omitting the negligible terms for large N and p yields update rules for (wi)* and 
of, as 



a 



t+i ^ 



a In 






y^\{A^Y-^^a' 






d{A^y 




(25) 



and 
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(wi)* = 



9 In 



z (el - 




(26) 



respectively, where 0* = these equations, the macroscopic vari- 

ables S'* and are evaluated as 



and 



= -y 

N ^ 



N 



z el- r 



t-i 



(Wi)* 



Z=1 



(deif 



1 



r* = — V 

N ^ 

fi=i 



P 9^1n 



(j 



(APy - S‘a* ; S* 






(27) 



(28) 



respectively, from (wi)* and a(,. Equations (25)-(28) denote the general expres- 
sion of the above-developed approximation algorithm. 

We would now like to discuss a number of important points. The first is con- 
cerning the cost of computation. Equations (25) and (26), which are the heart of 
the developed algorithm, can be performed with 0{N) and 0{p) computations 
for each index p or i, respectively, which indicates that the necessary compu- 
tational cost per update is reduced from 0{N^) to O(fV^). This is because the 
number of variables to handle can be reduced from 0{N“^) to 0{N) by using 
the perturbation expansions (23) and (24) for large systems. The second point 
of discussion is the link to the Bayesian inference. The Bayesian predictive dis- 
tribution (4) for a novel p + 1-th example can be computed as 



P{yP+yxP+\ DP) = j Dzg (yP+y (AP+y* + 

= g{yP+y{AP+^y;Sy, (29) 



from the estimates at the t-th update, which indicates how the developed algo- 
rithm can be utilized for high-performance inference. The third point of discus- 
sion is the extensibility. The developed algorithm can be extended to the cases of 
the correlated distribution in which the covariance matrix C = (xiXj — XiXj) is 
not diagonal. In such cases, a similar algorithm is obtained for continuous param- 
eters by modifying the factorized prior distribution (6) to a correlated Gaussian 
distribution P{w) = (27r)“*^/^(det (F > 0), as proposed 
in [4] while the linear response relation {wiWjy — {wi)* (wj)* = [16,17] 

j 

offers another possibility of extension for other priors. Finally, as mentioned 
below, in developing the current algorithm, we have assumed that the distribu- 
tion (covariance) of input vectors is known. In addition, we have utilized the 
self-averaging property under the assumption that the size of the network is suf- 
ficiently large. However, these assumptions are not necessarily both satisfied for 
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real-world problems, for which the developed algorithm may not exhibit a suffi- 
cient approximation performance. For such cases, an alternative scheme termed 
the adaptive TAP approach has been developed [5]. This alternative scheme can 
be adapted to the given examples without knowledge of the properties of the 
background distribution from which the examples are generated. In addition, 
this scheme uses the same equilibrium solution as the current algorithm when 
the above two assumptions are satisfied. However, since the efficient scheme has 
not been fully examined yet, the necessary computational cost of this method is 
typically O(N^), which may limit its applicability to relatively small networks. 
Therefore, the present algorithm of computational cost 0{N“^) is still competitive 
for large networks. 



3.5 Macroscopic Analysis in a Teacher- Student Scenario 

In order to examine the properties of the proposed BP-based algorithm, let us 
assume a teacher-student scenario in which each output (m = 1, 2, . . . ,p) is 
labeled by a teacher network that is a perceptron-type probabilistic model (5) 
of a true parameter = {wi, w®, . . . , (|rc°P = N). In such situations, the 
properties of the algorithm can be analyzed by monitoring several macroscopic 
variables. For comparison with the existing results from SM [22], we assume a 

factorizable Gaussian prior P{wi) = where F is adaptively deter- 

mined from the spherical constraint 

For the macroscopic analysis, we assume that signal/noise separation is pos- 
sible in as 




R* 0 
- — < 
P 




(30) 



where i?* and Q* represent the signal and noise strengths, respectively, and z^i 
is modeled as an independent normal Gaussian random number for each pair 
of indices p, and i. Inserting Eq. (30) into Eq. (16) and neglecting statistical 
fluctuations and small differences yields /N ~ ■ (w)* /N = R* and 

I P/fV ~ I (w)* P/A = Q* hold for p = 1,2, .. . ,p, and the macroscopic 
variables i?* and Q* can be evaluated as 



= 



R* 






+ {R^f 



(31) 

(32) 



where F* = | + y 1 + 4(Q‘ + (-R*)^) J is obtained from the spherical con- 

straint. This constraint also yields the relation S'* = 1 — Q*. 
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Next, we examine the update of the strengths i?* and Q*. Equations (16) and 
(30), in conjunction with the self-averaging property, indicate that i?* and Q* 
are updated as 



P r 

^ E ; oE' ^ “ / dy^9(y>^\A[;)A[; 

a=l Ei=lK“)" J 



9 In 



g{yP\{A^^f^■S*) 



d{AP)\ 



,(33) 



g‘+i ^ ^ :^a dy^g{y^\A[;) 

fi=l d 



9 In 



g{yp\{Af^f -s^) 



d{AP)\ 




where ^ {^iY = 1- addition, statistical 

fluctuations and small differences between {Af)^ and {AP') ^ were neglected. In 
order to evaluate Eqs. (33) and (34), let us suppose that the relative position of 
(m)^ to is characterized by i?* and Q*. In such cases, the central limit theorem 
implies that {Ap)*^ = {w)^-xP and Z\q = -xP can be regarded as 

zero-mean correlated Gaussian random numbers, the second moments of which 
are obtained as 



[{AP)l)" = Q\ A^,{AP)l = R\ {A^,f = l, (35) 

because each component of xP is generated from a distribution of the zero mean 
and unit variance, independent of other inputs x'^'^p . This allows us to represent 
these variables as 



{AP)l = y^^u„ = 



Q 



t 



w 

7^ 






(36) 



using independent normal Gaussian random numbers and v^. Next, by ap- 
plying the formula f Dzzf{z) = f Dzf'{z), we obtain 



i?‘+i = 



dQ y\ 






dyDu- 









a In 



g{y\^u;S*) 



du 






du 



a In 



Q = 7^ dyDuG y\^=u;^ 



g{y\^*u; S*) 



Q 






du 



(37) 



, (38) 



where r;* = ^ gt , which, in conjunction with Eqs. (31) and (32), describe the 
macroscopic behavior of the BP dynamics. 

Note that these equations represent only the natural iteration of the saddle 
point equations of the replica analysis under the replica symmetric (RS) ansatz 
for the present teacher-student problem [22]. Since it is widely believed that 
the RS solution correctly predicts the behavior of the exact solution in the 
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t a 

Fig. 2. (a) Trajectories of the overlap Symbols and error-bars represent the average 
and standard deviation, respectively, obtained from 100 experiments for N = 100 
and p = 200. The horizontal axis indicates the number of npdates t. The lines were 
obtained by Eqs. (31), (32), (37) and (38). (b) Prediction error = Prob{j/^+^ / 
obtained for the Bayesian inference based on Eq. (29) plotted with respect to 
a = p/N for N = 100 and varying p. Symbols and error-bars are based on the results 
of 100 experiments, whereas the lines were obtained from the convergent solution of 
Eqs. (31), (32), (37) and (38). 



thermodynamic limit if the architectures of the teacher and student are of the 
same type [13], the proposed algorithm is thought to provide a solution very 
close to the exact solution when the size of the network is sufficiently large. 

In order to verify this supposition, we performed numerical experiments for 
the case of a binary output y = ±1. In the experiments, the conditional distribu- 
tion was obtained as P{y\x, w) = g{y\^) = Dz. In Fig. 2 (a), the symbols 

denote the trajectories of R* experimentally obtained from 100 experiments us- 
ing Eqs. (25) and (26) for N = 100, p = 200 and a = 0.1, 0.5, 1.0, whereas the 
lines denote the theoretical predictions obtained using Eqs. (31), (32), (37) and 
(38). Figure 2 (b) plots the prediction error as a function of the ratio a = p/lV, 
varying p while maintaining N = 100, when the Bayesian inference is performed 
based on Eq. (29) using the equilibrium solution. The excellent correspondence 
between the experimental results and theoretical predictions shown by these 
figures strongly supports our supposition. 



4 Application to CDMA Multi-user Detection 

In the preceding section, we developed a BP-based approximation algorithm of 
computational cost 0{N‘^) per update for the Bayesian inference in perceptron- 
type networks. Although this algorithm may be competitive for large networks, 
since several of the assumptions adopted herein are somewhat artificial, the 
significance of the proposed algorithm with respect to real-world problems is 
suspect. Therefore, we next present an application to the multi-user detection 
problem that arises in a code division multiple access (CDMA) scheme. Code di- 
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vision multiple access is a core technology of the current wireless communications 
system [10,11,23]. 

In the general scenario of a CDMA system, the binary signals of multiple 
users are modulated by spreading codes assigned to each user beforehand, and 
these modulated sequences are then transmitted to a base station. The base 
station (BS) receives a mixture of the modulated sequences and possible noise. 
Thereafter, a detector at the BS extracts the original binary signals from the 
received real signals using knowledge of the users’ spreading codes. 

Multiuser detection is a scheme used in the detection stage. By simultane- 
ously estimating multiple user signals following the Bayesian framework, this 
scheme suppresses mutual interference and can minimize the rate of detection 
error. However, as following the Bayesian framework is computationally difficult, 
the development of approximation algorithms is necessary for practical imple- 
mentation. 

The technique developed in the preceding section is a promising method 
for addressing this problem. In order to demonstrate the potential of this tech- 
nique, we next examine an 7V-user direct-sequence binary phase shift-keying 
(DS/BPSK) CDMA using random binary spreading codes of spreading factor 
p with unit energy over an additive white Gaussian noise (AWGN) channel^. 
For simplicity, we assume that the signal powers are completely controlled to 
unit energy, while extension to the case of distributed power is straightforward. 
In addition, we assume that the chip timing, as well as the symbol timing, are 
perfectly synchronized among users. Under these assumptions [11], the received 
signal can be expressed as 



^ -b crn^, (39) 

yP i=i 

where p = 1,2,... ,p and i = 1,2,... ,N are indices of samples and users, 
respectively, G {~1) 1} is the spreading code independently generated from 
the identical uniform distribution P(xf = -1-1) = ^( 2 :^ = —1) = 1/2, which 
yields cc/ = 0 and Wi G {-bl,— 1} is the bit signal of user i, 

Ufi is a Gaussian noise sample with zero mean and unit variance, and a is the 
standard deviation of AWGN. In the following, we assume that both N and p 
are sufficiently large, maintaining a = pjN finite, which may not be far from 
practical, since a relatively large spreading factor of up to p = 256 is possible in 
“cdma2000” , one of the third-generation cellular phone systems [23] . 

The goal of multiuser detection is to simultaneously estimate bit signals 
wi,W 2 ,--. ,W]s[ after receiving the signals y^,y‘^,... ,y^ when all components 
of the spreading codes are known. The Bayesian approach offers a useful 
framework for this case. Assuming that the bit signals are independently gener- 
ated from the uniform distribution, the posterior distribution from the received 

^ The notation used here differs from the conventional notation in order to emphasize 
the link to the perceptron learning problem argued in the previous section. 
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Fig. 3. Time evolution of BER for N = 1000 and p = 2000. □ and x represent the 
experimental results obtained from 10000 simulations for the developed algorithm and 
the multi-stage detection [24] , whereas the lines represent certain macroscopic equations 
corresponding to Eqs. (31), (32), (37) and (38). 



signals is obtained as 



P{w\DP) 



'Em Pif^) n^=i P(y^\xP^, w) ’ 



(40) 



where indicates the set of all received signals and spreading codes xf, 

1)] is the uniform prior distribution for 



p{w) = 2-^ nti - 1) + 

the transmitted signals, {xi,X 2 
and 









) is denoted as for each p, A^- = 



P{y^\xP^,w) 



1 



exp 



1 




2 ' 



(41) 



represents the AWGN channel. The Bayesian optimal estimator that minimizes 
the bit error rate (BER), which is the probability of occurrence of bitwise esti- 
mation error, is evaluated based on the posterior distribution (40) as 



riifaye® = argmax ^ P{w\DP) = sgn(('u;i)) , (42) 



where sgn(x) = 1 for a; > 0, and 0 otherwise. 

Note that this problem can be regarded as a Bayesian inference problem in 
a large perceptron, the parameters of which take only binary values Wi = ±1. 
Therefore, we can utilize the technique developed in the preceding sections for 
this problem by regarding Eq. (41) as g{yP\AP') and employing the uniform 
binary prior P{wi) = | {5{wi — 1) -I- 6{wi + 1)). This yields an approximation 
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algorithm for the multi-user detection problem as 

C = _ Q, - E ^ + =■<) . («) 

(Wi)‘ = tanh > (44) 

where the macroscopic variables S'* and F* are evaluated as S* = 1— Q* and E* = 
a (oCT^ -I- 1 — Q*) , respectively [10]. Equations (43) and (44), which correspond 

respectively to Eqs. (25) and (26) of the previous section, can be performed in 
O(fV^) computations per update when a = p/N ^ 0(1). 

Figure 3 shows the time evolution of BER obtained from Eqs. (43) and (44) 
in comparison with the multi-stage detection (MSD) algorithm, which is one 
of the conventional approximation schemes for the current problem [24]. This 
indicates that the proposed BP-based algorithm exhibits a considerably faster 
convergence rate than MSD, which is preferred for practical purposes. 

5 Summary 

In summary, we have developed an algorithm that approximates the Bayesian 
inference in a single-layered feed-forward network (perceptron) based on BP. 
Although the direct application of BP in perceptrons is not computationally 
tractable, we have shown that this difficulty can be overcome by using techniques 
from SM under certain assumptions, which leads to an iterative algorithm that 
can be performed in 0{N^) computations per update when the numbers of 
examples and parameters, p and N, are of the same order. Examination of the 
properties of the algorithm for a simple teacher-student scenario indicated that a 
nearly optimal solution can be obtained using this algorithm for ideal situations. 
In order to demonstrate the practical significance of the proposed algorithm, we 
have presented an application to the CDMA multi-user detection problem that 
arises in a digital wireless communications system. 

Extending the applicability of the proposed algorithm to multi-layered per- 
ceptrons is one of the challenging future works. 
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Abstract. We present a framework for approximate inference in prob- 
abilistic data models which is based on free energies. The free energy is 
constructed from two approximating distributions which encode differ- 
ent aspects of the intractable model. Consistency between distributions 
is required on a chosen set of moments. We find good performance using 
sets of moments which either specify factorized nodes or a spanning tree 
on the nodes. 
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1 Introduction 

Probabilistic data models explain the dependencies of complex observed data by 
a set of hidden variables and the joint probability distribution of all variables. 
The development of tractable approximations for the statistical inference with 
these models is essential for developing their full potential. Such approximations 
are necessary for models with a large number of variables, because the computa- 
tion of the marginal distributions of hidden variables and the learning of model 
parameters requires high dimensional summations or integrations. 

The most popular approximation is the Variational Approximation (VA) [2] 
which replaces the true probability distribution by an optimized simpler one, 
where multivariate Gaussians or distributions factorizing in certain groups of 
variables [1] are possible choices. The neglecting of correlations for factorizing 
distributions is of course a drawback. On the other hand, multivariate Gaussians 
allow for correlations but are restricted to continuous random variables which 
have the entire real space as their natural domain (otherwise, we get an infinite 
relative entropy which is used as a measure for comparing exact and approximate 
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densities in the VA). In this paper, we will discuss approximations which allow 
to circumvent these drawbacks. These will be derived from a Gibbs Free Energy 
(GFE), an entropic quantity which (originally developed in Statistical Physics) 
allows us to formulate the statistical inference problem as an optimization prob- 
lem. While the true GFE is usually not exactly tractable, certain approximations 
can give quite accurate results. We will specialise on an expectation consistent 
(EG) approach which requires consistency between two complimentary approxi- 
mations (say, a factorizing or tree with a Gaussian one) to the same probabilistic 
model. 

The method is a generalization of the adaptive TAP approach (ADATAP) 
[16,15] developed for inference on densely connected graphical models which has 
been applied successfully to a variety of relevant problems. These include Gaus- 
sian process models [17,14,10,11], probabilistic independent component analysis 
[6], the GDMA coding model in telecommunications [4], bootstrap methods for 
kernel machines [7,8], a model for wind field retrieval from satellite observations 
[3] and a sparse kernel approach [12]. For a different, but related approximation 
scheme see [10,9]. 

2 Approximative Inference 

Inference on the hidden variables x = {xi^X 2 -, ■ ■ ■ ,xn) of a probabilistic model 
usually requires the computation of expectations, ie of certain sums or integrals 
involving a probability distribution with density 

pW = ^/(x) . (1) 

This density represents the posterior distribution of x conditioned on the ob- 
served data, the latter appearing as parameters in p. Z = f dxf(x) is the nor- 
malizing partition function. 

Although some results can be stated in fairly general form, we will mostly 
specialize on densities (with respect to the Lebesgue measure in R^) of the form 



p(x) = W^Tii^Xi) exp 

i 




(2) 



where the Tfs are non- Gaussian functions. This also includes the important case 
of Ising variables = ±1 by setting 

T{x,) = {6{xi -k 1) -k S{xi - 1)) . (3) 



The type of density (2) appears as the posterior distribution for all models cited 
at the end of the introduction chapter. 

p(x) is a product of two functions p(x) = /i(x)/ 2 (x), where both the fac- 
torizing part /i = WiTii^Xi) and the Gaussian part /2 = XiJijXj] 
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individually are simple enough to allow for exact computations. Hence, as an 
approximation, we might want to keep f\ but replace /2 by a function which 
also factorizes in the components Xi. As an alternative, one may keep /2 but re- 
place /i by a Gaussian function to make the whole distribution Gaussian. Both 
choices are not ideal. The first completely neglects correlations of the variables 
but leads to marginal distributions of the Xi, which may share non Gaussian fea- 
tures (such as multimodality) with the true marginal. The second one neglects 
such features but incorporates nontrivial correlations. We will later develop an 
approach for combining these two approximations. 



3 Gibbs Free Energies 



Gibbs free energies (GFE) provide a convenient formalism for dealing with prob- 
abilistic approximations. In this framework, the true, intractable distribution 
p(x) is implicitly characterized as the solution of an optimization problem de- 
fined through the the relative entropy (KL divergence) 



between p and other trial distributions q. We consider a two stage optimization 
process, where in the first step, the trial distributions q are constrained by fixing 
a set of values jj, = (g(x))g for a set of generalized moments. The Gibbs Free 
Energy G(/i) is defined as 

G(/x) = min {KL{q, p) | (g(x)), = /i} - In Z , (5) 

g 

where the term In Z is subtracted to make the expression independent of the 
intractable partition function Z. In a second stage, both the true values fj. = 
(g(x))p and the partition function Z are found by relaxing the constraints ie by 
minimizing G(/i) with respect to /x: 



KL{q,p) = j dx g(x) In 






P{ 



min G(/x) = — In Z and (g) = argmin G(/x) . (6) 

M /i 



A variational upper bound to G is obtained by restricting the minimization in 
(5) to a subset of densities q. 

It can be easily shown that the optimizing distribution (5) is of the form 

d(x) = exp (A^g(x)) , (7) 

where the set of Lagrange parameters A = A(/x) is chosen such that the conditions 
(g(x))q = n are fulfilled, i.e. A satisfies 



ainZ(A) 

where Z(A) is a normalizing partition function. 



(8) 
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Inserting the optimizing distribution eq. (7) into eq. (5) yields the dual rep- 
resentation of the Gibbs free energy 

G(/x) = — lnZ(A(/x)) + A^(/i)/x = max |— In Z(A) + A^/i| , (9) 

showing that G is the Legendre transform of — In Z( A) making G a convex func- 
tion of its arguments. 

We will later use the following simple result for the derivative of the GFE 
with respect to a parameter t contained in the probability density p(x\t) = ■ 

This can be calculated using (9) and (8) as 

dGtifi) dlnZ{\,t) , f dlnZ{X,t)\ dX^ dlnZ{\,t) 
dt ~ Wt 9A )~df~ Wt ■ ^ ’ 

Hence, we can keep A fixed upon differentiation. 



3.1 Simple Models 

We give results for Gibbs free energies of three tractable models and choices of 
moments (g(x)). These will be used later as building blocks for the free energies 
of more complicated models. 



Independent Ising variables. The Gibbs free energy for a set of independent 
Ising variables each with a density of the form (3) and fixed first moments 
fi — {x) = Yn = {mi, m 2 , ■ ■ ■ , toat ) is G(m) = ^ - Gi{mi) where 



Gi{mi) = 



(1 



4i„<i 



+ mi) (!-■ 



i) {I -mi 
— in 



— 9j7 



(11) 



It will be useful to introduce a more complicated set of moments for this 
simple noninteracting model. We choose a tree graph Q out of all possible sets 
of edges linking the variables x and fix the second moments = (xiXj) along 
theses edges as constraints. In this case, it can be shown that the free energy is 
represented in terms of single- and two-node free energies 



G(m, { Afjj ^ ) Gij{mi 
(ij)^Q 



mj,Mij) -b y^(l - ni)Gi{mi) 






(12) 



where Gij{mi,mj, Mij) is the two-node free energy computed for a single pair 
of variables, and Ui is the number of links to node i. 



Multivariate Gaussians. The Gaussian model is of the form (2) with L'i{xi) oc 
exp[aiXi — yxf]. Here, we fix /x = (m, M) where m is the set of all first moments 
and M is an arbitrary subset of second moments (xiXj) = Mij = Mji. We get 

G(m,M) = -im^Jm-m^a-l- (13) 

i 

-b max I - lndet(yl — J) Tryl(M — mm^)l , 

A {2 2 J 

where H is a matrix of Lagrangemultipliers conjugate to the values of M. 
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3.2 Complex Models: A Perturbative Representation 

We will now concentrate on the more complex model (2) together with a suitably 
chosen set of moments. We will represent the Gibbs free energy of this model as 
the GFE for a tractable “noninteracting” part /i = plus a correction 

for the “interaction” term /2 = exp XiJijXj^ . We fix as constraints all the 

first moments m and a subset of second moments M which is chosen in such a 
way that the Gibbs free energy for /i remains still tractable. Different choices of 
second moments will allow later for more accurate approximations. If all second 
moments are fixed, our result will be exact, but for most models of the form (2) 
this leads again to intractable computations. 

We define / 2 (x,t) to be a smooth interpolation between the trivial case 
/ 2 (x,t = 0) = 1 and the “full” intractable case / 2 (x,t = 1) = / 2 (x). For the 
model (2) we can set 



/ 2 (x,t)=exp t'^XiJ,. 



i<j 



Differentiating the Gibbs free energy with respect to t, using eq. (10), we get 

G{^A) - G{^,0) = - J\t . (14) 



q{x\t) 



where q{x\t) = -^-^^^/i(x)/ 2 (x, t) exp ^A^g(x)^ . For the model (2) this can 
be written as 



^(m) — ^(M? 1) — 0) / dt Jij {XiXj) q(^x\t) 

Jo ■ ^ ■ 

1<J 



(15) 



4 Approximations to the Free Energy 

4.1 Mean Field Approximation 

If we restrict ourselves to fixed diagonal second moments Mu only, the sim- 
plest approximation is obtained by replacing the expectation over q{x\t) by the 
factorizing distribution g(a:|0) giving 

G{n) « G(/i, 0) - ^ J^jmirrij . (16) 

i<3 

This result is equivalent to the variational mean field approximation, obtained 
by restricting the minimization in (5) to densities of the form <7(x|0). Hence, it 
gives an upper hound to the true GFE. 
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4.2 Perturbative Expansion 

One can improve on the mean field result by turning the exact expression (15) 
into a series expansion of the free energy in powers of t, setting t = 1 at the end. 
It is easy to see that the term linear in t corresponds to the mean field result. The 
second order term of this so-called Plefka expansion can be found in [18], see also 
several contributions in [13]. While the second order term seems to be sufficient 
for models with random independent couplings in a “thermodynamic limit” , 
more advanced approximations are necessary in general [16,15]. 

4.3 A Lower Bound to the Gibbs Free Energy for Ising Variables 

This was recently found by Wainwright & Jordan [20,19] and can be obtained 
by specifying all second moments M. Then it is easy to see from the definition 
of the free energy that 



G(/x) + ^ = -iJ[x] , 

i<j 

where iJ[x] equals the (discrete) negative entropy of the random variable x. 
Wainwright and Jordan construct a continuous random variable x (a noisy ver- 
sion of x) which has the same differential entropy /i[x] = H[x\. Now they can 
apply a Maximum -Entropy argument and upper bound h[x] by the differential 
entropy /iGauas[x] of a Gaussian with the same moments: 

1 TV 77 P 

- H[x] = -/i[x] > -/iGauaJx] = - logdet[Cov(x)] -b — log( y) (17) 
= Tlogdet[^Cov(x) -b T/at] -b y log(y) . 

The approximate free energy comes out a convex function of its arguments. 

4.4 Bethe Kikuchi Type of Approximations 

These are usually applied to discrete random variables and become exact if the 
graph which is defined by the edges of nonzero couplings Jij 0 is a tree or 
(for the Kikuchi approximation) a more generalized cluster of nodes. For tree 
connected graphs, the joint density of variables can always be expressed through 
single and two node marginals (similar to (12)). Using this structure within the 
optimization (5), one can calculate the Gibbs free energy exactly and efficiently 
when all first moments and the second moments along the edges of the graph 
are fixed. The approximation [21,22,5] is obtained when the graph of nonzero 
couplings is not a tree, but the simple form of the tree type distribution is 
still used in the optimization (5). Although the variation is over a subset of 
distributions, the Bethe-Kikuchi approximations do not lead to an upper bound 
to the free energy. This is because the constraints are no longer along trees and 
are thus not consistent with the distribution assumed. 
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5 Expectation Consistent Approximations 



Our goal is to come up with another approximation which improves over the 
mean field result (16) by making a more clever approximation to 9 (x|t) in (15). 
We will use our assumption that we may approximate the density (2) by alter- 
natively discarding the factor /i(x) as intractable, replacing the density 9(x|t) 

by 



"(x|t) = ^ ^^ /2(x,t)exp (A^g(x)) , 



(18) 



where the parameters A are chosen to have consistency for the expectations of 
g, i-e. (g(x))r(a,|4) = /X. 

r(x|t) defines another Gibbs free energy with a dual representation eq. (9) 



Gr(/x, t) = max In Zr(A, t) + A^/x| . 



We will use r{x\t) to treat the integral in eq. (14), writing 



f'dt 

/o \ / q(x\t) 



f'dt 

0 \ / r(x\t) 



\ dt ! .In ' \ dt 

Using the relations eqs. (10) for the free energy eq. (19) we get 
'dln/2(x,t) 



dt 



dt 



— 1 ) — Grlfl, 0 ) . 



Jo \ / r(x\t) 

and arrive at the expectation consistent (EC) approximation: 

G(/x) « G(/x, 0) + G,(m, 1) - Gr{n, 0) = G^^{n) . 



(19) 



(20) 



( 21 ) 



(22) 



6 Results for Ising Variables 

We will now apply our EC framework to the model (2) with Ising variables 
Xi = ±1. We will discuss two types of approximations which differ by the set 
of fixed second moments Mij. By fixing more and more second moments, we 
reduce the number of interaction terms of the form JijXiXj which are not fixed 
and have to be approximated. 

Since r(x|t) is a multivariate Gaussian, we have 



1 



G^^(m, M) = G(m, M, 0) — 



(23) 



-I- max I - lndet(yl — J) — x Tr/1(M — mm^) 
A (2 2 

— max I - In det A Tr yl(M — mm^) 

a 1 2 2 ^ ^ 
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To obtain estimates for the second moments which are not fixed in the free 
energy, we take derivatives of the free energy with respect to coupling parameters 
Jij yielding 



(xx^)-(x)(x^) = (A-J)-i . (24) 

This result is also consistent with the fixed values M for the second moments. 

6.1 Diagonal Approximation 

When we fix only the trivial diagonal second moments Mu = {xf) = 1 (Ising 
constraints), M does not appear as a variable in the free energy. The EC ap- 
proximation eq. (23) is then given by 

G^(m) = G^’^(m) — -m^Jm (25) 

f 1 1 ^ 

-I- max < - lndet(yl — J) — - AAl — m3) 

A \ 2 2 ^ ' 

L i=i 

1 V— \ , 2\ ^ 

^ i^l 

where G^®(m) is given by eq. (11) and A is a diagonal matrix of Lagrange pa- 
rameters. This result coincides with the older adaptive TAP approximation [16, 
15]. 

6.2 Tree Approximation 

A more complex, but still tractable approximation is obtained by selecting an ar- 
bitrary tree connected subgraph of pairs of nodes and fixing the second moments 
of the Ising variables along the edges of this graph. The free energy is again of 
the form eq. (23) but now with G(m, M,0) given by eq. (12), the Lagrange pa- 
rameters Aij are restricted to be non-zero on the tree graph only. If the tree is 
chosen in such a way as to include the most important couplings (defined in a 
proper way), one can expect that the approximation will improve significantly 
over the diagonal case. 

7 Simulations 

We compare results of the EC approximation with those of the Bethe-Kikuchi 
approaches on a toy problem suggested in [5]. We use A = 10 nodes, con- 
stant “external fields” 9i = 9 = 0.1. The Jij's are drawn independently at 
random, setting Jij = f3wij /'/N, with Gaussian w^’s of zero mean and unit 
variance. We study eight different scaling factors (3 = [0.10, 0.25, 0.50, 0.75, 
1.00, 1.50,2.00,10.00]. The results are summarized in figures 1 and 2. Figure 1 
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Fig. 1. Maximal absolute deviation (MAD) for one- (left) and two-variable (right) 
marginals. Blue upper full line: EC factorized, blue lower full line EC tree, green dashed 
line: Bethe and red dash-dotted line: Kikuchi. 




Fig. 2. Left plot: free energy for EC factorized and tree (blue full line), Bethe (green 
dashed line), Kikuchi (red dash-dotted) and exact (stars). Right: Absolute deviation 
(AD) for the three approximations, same line color and type as above. Lower full line 
is for the tree EC approximation. 



gives the maximum absolute deviation (MAD) of our results from the exact 
marginals for different scaling parameters. We consider one- variable marginals 
p{xi) = and the two-variable marginals p{xi,Xj) = + p{xi)p{xj) 

with the approximate covariance Cij = {xiXj) — {xi){xj) given by eq. (24). Figure 
2 gives estimates for the free energy. The results show that the simple diagonal 
EC approach gives performance similar to (and in many case better than) the 
more structured Bethe and Kikuchi approximations. 

For the EC tree approximation, we construct a spanning tree of edges by 
choosing as the next edge, the (so far unlinked) pair of nodes with strongest 
absolute coupling |Jy| that will not cause a loop in the graph. The EC tree 
version is almost always better than the other approximations. A comparison 
with the Wainwright-Jordan approximation (corresponding to (17)) and details 
of the algorithm will be given elsewhere. 
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8 Outlook 

We have introduced a scheme for approximate inference with probabilistic mod- 
els. It is based on a free energy expansion around an exactly tractable substruc- 
ture (like a tree) where the remaining interactions are treated in a Gaussian 
approximation thereby retaining nontrivial correlations. In the future, we plan 
to combine our method with a perturbative approach which may allow for a sys- 
tematic improvement together with an estimate of the error involved. We will 
also work on an extension of our framework to more complex types of proba- 
bilistic models beyond the pairwise interaction case. 
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